File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1324_metho.xml
Size: 18,168 bytes
Last Modified: 2025-10-06 14:07:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1324"> <Title>A query tool for syntactically annotated corpora*</Title> <Section position="4" start_page="191" end_page="192" type="metho"> <SectionTitle> 3 The query language </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="191" end_page="191" type="sub_section"> <SectionTitle> 3.1 Syntax </SectionTitle> <Paragraph position="0"> As query language for the German Verbmobil corpus, a first order logic without quantification is chosen where variables are interpreted as existentially quantified. Negation is only allowed for atomic formula. It seems that even this very simple logic already gives a high degree of expressive power with respect to the queries linguists are interested in (see for example (Kallmeyer, 2000) for theoretical investigations of query languages). However, it might be that at a later stage the query language will be extended.</Paragraph> <Paragraph position="1"> Let C (the node labels, i.e. syntactic categories and part-of-speech categories), E (the edge labels, i.e. grammatical functions) and T (the terminals, i.e. tokens) be pairwise disjoint finite sets. >, >>, .. are constants for the binary relations immediate dominance (parent relation), dominance (reflexive transitive closure of immediate dominance) and linear precedence. The set 1N of natural numbers is used as variables. Further, ~, I, ! are logical connectives (conjunction, disjunction, and negation).</Paragraph> <Paragraph position="2"> Definition 1 ((C, E, T)-queries) ( C, E, T)-queries are inductively defined: (a) for all iE IN, tE T: token(i)=t and token(i) !=t are queries, (b) for all iE IN, cE C: cat (i)=c and cat (i) !=c are queries, (c) for all iE IN, eE E: fct(i)=e and fct(i)!=e are queries, (d) for all i, j E IN:</Paragraph> <Paragraph position="4"> (e) for all queries ql, q2: ql ~ q2 and (ql I q2) are queries. Of course, when adapting this language to another corpus, depending on the specific annotation scheme, other unary or binary predicates might be added to the query language. This does not change the complexity of the query language in general.</Paragraph> <Paragraph position="5"> However, it is also possible that at a later point negation needs to be allowed in a general way or that quantification needs to be added to the query language for linguistic reasons. Such modifications would affect the complexity of the language and the performance of the tool. Therefore the decision was taken to keep the language as simple as possible in the beginning.</Paragraph> </Section> <Section position="2" start_page="191" end_page="192" type="sub_section"> <SectionTitle> 3.2 Intended models </SectionTitle> <Paragraph position="0"> In the case of the German Verbmobil corpus, the data structures are not trees, since structures as in Fig. 2, which shows the annotation of the long-distance wh-movement in (3), can occur. The structure in Fig. 2 does not have a unique root node, and the two nodes with label SINPX have neither a dominance nor a linear precedence relation.</Paragraph> <Paragraph position="1"> (3) wen glaubst du liebt Maria whom believe you loves Maria 'whom do you believe Maria loves' Therefore, the models of our queries are defined as more general structures than finite trees.</Paragraph> <Paragraph position="2"> A model is a tuple (/g, T ~, T), PS, p, ~/, a) where/g is the set of nodes, 7 ~, T~ and PS are the</Paragraph> <Paragraph position="4"/> </Section> </Section> <Section position="5" start_page="192" end_page="193" type="metho"> <SectionTitle> WFIN PPER VVFIN </SectionTitle> <Paragraph position="0"> binary relations immediate dominance (parent), dominance and linear precedence, # is a function assigning syntactic categories or part-of-speech tags to nodes, r/ is a function mapping edges to grammatical functions, and a assigns tokens to the leaves (i.e. the nodes that do not dominate any other node).</Paragraph> <Paragraph position="1"> ,.? , Definition 2 (Query model) Let C, E and T be disjoint alphabets.</Paragraph> <Paragraph position="2"> (ILl, 79, 73, PS, #, rl, a) is a query model with cat- null egories C, edge labels E and terminals Tiff 1. l~ is a finite set with Lt n (C U E U T) = O, the set of nodes.</Paragraph> <Paragraph position="3"> 2. P, PS, 73 * ILl x U, such that: (a) 79 is irreflexive, and for all x * ILl there is at most one v * ILl with (v, x) * 79.</Paragraph> <Paragraph position="4"> (b) 73 is the reflexive transitive closure of 79, and 73 is antisymmetric.</Paragraph> <Paragraph position="5"> (c) PS is transitive.</Paragraph> <Paragraph position="6"> (d) .for all x, y * lg: if (x, y) * PS, then <~, y) C/ 73 and (u, x) C/ 73.</Paragraph> <Paragraph position="7"> (e) for all z,y * U: (x,y) * PS i# for all z, w * ld with (x, z), (y, w) * 73, (z, w) * PS holds.</Paragraph> <Paragraph position="8"> (f) for all x, y, z * L(: if (x, y), (x, z) * 73, then either (x, z) * 73 or (z, x) * 73 or (x, z) * PS or (z, x) * PS.</Paragraph> <Paragraph position="9"> 3. # : Lt ~ C is a total .function.</Paragraph> <Paragraph position="10"> 4. rl : 7 ~ ~ E is a total .function.</Paragraph> <Paragraph position="11"> 5. a : {u * Ltl there is no u' with (u,u') * 79} ~ T is a total .function.</Paragraph> <Paragraph position="12"> With (b), (c) and (d), PS is also irreflexive and antisymmetric.</Paragraph> <Paragraph position="13"> In contrast to finite trees, our query models do not necessarily have a unique root node, i.e. a node that dominates all other nodes. Consequently, the so-called exhaustiveness property does not hold since two nodes in a query model might be completely disconnected. In other words, it does not hold in general that (x,y) * 73 or (y,x) * 73 or (x, y) * PS or (y, x) * PS for all x, y */4. This holds only for nodes x, y */4 where a node z exists that dominates x and y.</Paragraph> <Section position="1" start_page="192" end_page="193" type="sub_section"> <SectionTitle> 3.3 Semantics </SectionTitle> <Paragraph position="0"> Satisfiability of a query q by a query model M is defined in the classical model-theoretic way with respect to an injective assignment g mapping node variables to nodes in the query model.</Paragraph> <Paragraph position="1"> Note that the condition that g needs to be injective means that different variables are considered to refer to different-nodes. In this respect, Def. 3 differs from traditional model-theoretic semantics.</Paragraph> <Paragraph position="2"> As an example, consider the query for structures as in (1) that is shown in (4). The structure in Fig. 1 is a query model satisfying (4).</Paragraph> </Section> </Section> <Section position="6" start_page="193" end_page="195" type="metho"> <SectionTitle> 4 Storing the corpus in a database </SectionTitle> <Paragraph position="0"> As already mentioned, the general idea of the query tool is to store the information one wants to search for in a relational database and then to translate an expression in the query language presented in the previous section into an SQL expression that is evaluated on the database. The first part is performed by an initializing component and needs to be done only once per corpus, usually by the corpus administrator. The second part, i.e. the querying of the corpus, is performed by a query component.</Paragraph> <Paragraph position="1"> The tool is implemented in Java with Java Database Connectivity (JDBC) as interface and mysql as database management system.</Paragraph> <Section position="1" start_page="193" end_page="193" type="sub_section"> <SectionTitle> 4.1 The relational database schema </SectionTitle> <Paragraph position="0"> The German Verbmobil corpus consist of several subcorpora. In the relational database there are two global tables, node_class and pair_class. Besides these, for each of the sub-corpora identified by i there are tables tokens_/ and node_pair_/. The database schema is shown in Fig. 3. The arrows represent foreign keys. The colnmn cl_id in the table node_pair_/, for example, is a foreign key referring to the colnmn clad in the table pair_class. This means that each entry for clad in node_pair_/uniquely refers to One entry for clad in pair_class.</Paragraph> <Paragraph position="1"> The content of the tables is as follows: * node_class contains node classes characterized by category (node label) and grammatical function (edge label between the node and its mother). Each node class has a unique identifier, namely the column n_id.</Paragraph> <Paragraph position="2"> * pair_class contains classes of node pairs characterized by the two node classes and the parent, dominance and linear precedence relation between the two nodes. The columns pl, p2, dl, d2, 11 and 12 stand for binary relations and have values 1 or 0 depending on whether the relation holds or not. pl signifies immediate dominance of the first node over the second, p2 immediate dominance of the second over the first, dl dominance of the first over the second, etc. Each node pair class has a unique identifier, namely its clad.</Paragraph> <Paragraph position="3"> * tokens_/ contains all leaves from subcorpus i with their tokens (word).</Paragraph> <Paragraph position="4"> * node_pair_/contains all node pairs from subcorpus i with their pair class. Of course, only pairs of nodes belonging to one single annotation structure are stored.</Paragraph> </Section> <Section position="2" start_page="193" end_page="195" type="sub_section"> <SectionTitle> 4.2 Initializing the database </SectionTitle> <Paragraph position="0"> The storage of the corpus in the database is done by an initializing component. This component extracts information from the structures in export format (the format used for the German Verbmobil corpus) and stores them in the database. The export format explicitly encodes tokens, categories and edge labels, linear precedence between leaves and the parent (immediate dominance) relation. Dominance and linear precedence in general however need to be precompiled.</Paragraph> <Paragraph position="1"> First the dominance relation is computed simply as reflexive transitive closure of the parent relation.</Paragraph> <Paragraph position="2"> Linear precedence on the leaves can be immediately extracted from the export format.</Paragraph> <Paragraph position="3"> When computing linear precedence for internal nodes, the specific properties of the data structures in Verbmobil (see Section 3) must be taken into account. Unlike in finite trees, for two nodes Ul,U2, the fact that ul dominates some x and u2 dominates some y and x is left of y is not s,fl~cient to decide that Ul is left of u2. Instead (see axiom (e) in Def. 2) the following holds for two nodes ul, u2: ul is left of u2 iff for all x,y dominated by ul,u2 respectively: x is left of y.</Paragraph> <Paragraph position="4"> In general, the database schema itself does</Paragraph> <Paragraph position="6"> schembar nicht beides zus.</Paragraph> <Paragraph position="7"> and corresponding structure not reflect the concrete properties of the query model, in particular the properties of the binary relations are not part of the database schema, e.g. considering only the database, the dominance and linear precedence relations are not necessarily trA.n~itive. Therefore, the query tool can be easily adapted to other data structures, for example to feature structures with reentrancy as annotations. In this case, a modification of the part of the initializing component that computes the binary relations would be sufficient.</Paragraph> <Paragraph position="8"> As an example, consider how sentence 24 in the subcorpus cd20 (identifier 20) is stored in the database. This sentence was chosen for the simple reason that it is not too long but contains enough nodes to provide a useful example. Besides this, its construction and its tokens are not of any interest here.</Paragraph> <Paragraph position="9"> Fig. 4 shows the sentence in its export format, i.e. the way it originally occurs in the corpus, together with a picture of the corresponding structure. Parts of the tables in the responds to one node. The nodes are assigned numbers 0, 1, ... in the order of the lines in the export format. The nodes with tokens (i.e. that are leaves) are inserted into the table tokens_20. Furthermore, each pair of nodes occurring in sentence 24 is inserted into the table node_pair_20 together with its pair class. Both orders of a pair are stored. 1 The pair classes and node classes belonging to a pair can be found in the two global tables. Consider for example the nodes 9 and 10 in sentence 24 (the node labelled NX that dominates beides zusammen and the topmost node with label NX). The clad of this pair is 1327. *In a previous version just one order was stored but it turned out that for some queries this causes an exponential time complexity depending on the number of variables occurring in the query. This problem is avoided storing both orders of a node pair.</Paragraph> <Paragraph position="10"> #BOS 24 25 898511955 1 The corresponding entry in pair_class tells us that the second node is the :mother of the first, that the second dominates the first, and that there is no linear precedence relation between the two nodes. Furthermore, the node classes identified by n_idl and had2 are such that the first node has label NX and grammatical function HD whereas the second\[ has label NX and no grammatical function.</Paragraph> </Section> <Section position="3" start_page="195" end_page="195" type="sub_section"> <SectionTitle> 4.3 The size of the database </SectionTitle> <Paragraph position="0"> So far, in order to test the tool, approximately one quarter of the German Verbmobil corpus is stored in the database, namely the following subcorpora: id sub- trees tokens pairs The table pa~_class has 23024 entries and node_class has 213 entries. The following table shows the current size of the files:</Paragraph> </Section> </Section> <Section position="7" start_page="195" end_page="195" type="metho"> <SectionTitle> 5 Searching the corpus </SectionTitle> <Paragraph position="0"> In order to search the corpus, one needs of course to know the specific properties of the annotation scheme. These are described in the STTS guidelines (Schiller et al., 1995) and the Verbmobil stylebook (Stegmann et al., 1998) that must be both available to any user of the query tool.</Paragraph> <Paragraph position="1"> Currently, the query component does not yet process all possible expressions in the query language. In particular, it does not allow disjunctions and it does not allow to query for tokens. Other atomic queries combined with with negations and conjunctions are possible. In particular, complex syntactic structures involving category and edge labels and binary relations can be searched. The query component will be completed very soon to process all queries defined in Section 3.</Paragraph> <Paragraph position="2"> The query component takes an expression in the query language as input and translates this into a corresponding SQL expression, which is then passed to the database. As an example, consider again the query (4) repeated here as (5):</Paragraph> <Paragraph position="4"> For query (5) as input performed on the subcorpus cd20, the query component produces the following SQL query:</Paragraph> </Section> <Section position="8" start_page="195" end_page="195" type="metho"> <SectionTitle> SELECT DISTINCT npl.tree_id FROM </SectionTitle> <Paragraph position="0"> As a second example consider the search for long distance wh-movements as in (3). The annotation of (3) using the Verbmobil annotation scheme was shown in Fig. 2. Such structures might be characterized by the following properties: there is an interrogative pronoun (part-of-speech tag PWS for substituting interrogative pronoun) that is part of a simplex clause and there is another simplex clause containlng a finite verb such that the two simplex clauses are not connected and the pronoun precedes the finite verb. This leads to the query (6): npl. cl_id=pc 1. cl_id npl. nodel=np2, node 1 np 1. node2=np3, node 1 npl. tree_id=np2, tree_id np2. cl_id=pc2, cl_id np2. node2=np4, node 1 np2. tree_id=np3, tree_id np3. cl_id=pc3, cl_id np3. node 2=np4. node 2 np3. tree_id=np4, tree_id np4. cl_id=pc4, cl_id; Currently the database and the tool are running on a Pentium II PC 400MHz 128MB under Linux. On this machine, example (5) takes 1.46 sec to be answered by mysql, and example (6) takes 6.43 sec to be answered. This shows that although the queries, in particular the last one, are quite complex and involve many intermediate results, the performauce of the system is quite efficient. The performance of course depends crucially on the size of intermediate results. In cases where more than one node pair is searched for (as in the two examples above) the order of the pairs is important since the result set of the first pair restricts the second pair. In (5) for example, first a node pair with a PX with function OA-MOD dominated by a VF is searched for. Afterwards, the search for the NX with function 0A in the lffF is restricted to those trees that were found when searching for the first pair. Obviously, the first pair is much more restrictive than the second. If the order is reversed, the query takes much more time to process. Currently the ordering of the pairs needs to be done by the user, i.e. depends on the incoming query. However, we plan to implement at least partly an ordering of the binary conjuncts in the query depending on the frequency of the syntactic categories and grammatical functions involved in the pairs. The obvious advantage of using a relational database to store the corpus is that some parts of the work are taken over by the database management system such as the search of the corpus. Furthermore, and this is crucial, the indexing functionalities of the database management system can be used to increase the performance of the tool, e.g. indexes are put on clad in node_pair_/and on nAdl and had2 in pair_class.</Paragraph> </Section> class="xml-element"></Paper>