File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3028_metho.xml
Size: 9,922 bytes
Last Modified: 2025-10-06 14:09:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-3028"> <Title>A Flexible Stand-Off Data Model with Query Language for Multi-Level Annotation</Title> <Section position="4" start_page="0" end_page="109" type="metho"> <SectionTitle> 2 The Data Model </SectionTitle> <Paragraph position="0"> We propose a stand-off data model implemented in XML. The basedata is stored in a simple XML file which serves to identify individual tokens2 and associate an ID with each (Figure 1).</Paragraph> <Paragraph position="1"> In addition, there is one XML file for each annotation level. Each level has a unique, descriptive name, e.g. utterances or pos, and contains annotations in the form of <markable> elements. In the most simple case, a markable only identifies a sequence (i.e. span) of basedata elements (Figure 2).</Paragraph> <Paragraph position="2"> Normally, however, a markable is also associated with arbitrarily many user-defined attribute-value pairs (Figure 3, Figure 4). Markables can also be discontinuous, like markable 954 in Figure 4.</Paragraph> <Paragraph position="3"> For each level, admissible attributes and their values are defined in a separate annotation scheme file (not shown, cf. M&quot;uller & Strube (2003)). Freetext attributes can have any string value, while nominal attributes can have one of a (user-defined) closed set of possible values. The data model also supports associative relations between markables: Markable set relations associate arbitrarily many markables with each other in a transitive, undirected way. The coref class attribute in Figure 4 is an example of how such a relation can be used to represent a coreferential relation between markables (here: markable 954 and markable 963, rest of set <?xml version=&quot;1.0&quot; encoding=&quot;US-ASCII&quot;?> <!DOCTYPE markables SYSTEM &quot;markables.dtd&quot;> <markables xmlns=&quot;www.eml.org/NameSpaces/pos&quot;> ...</Paragraph> <Paragraph position="4"> <markable id=&quot;markable_665&quot; span=&quot;word_1064&quot; pos=&quot;PRP$&quot;/> <markable id=&quot;markable_666&quot; span=&quot;word_1065&quot; pos=&quot;,&quot;/> <markable id=&quot;markable_667&quot; span=&quot;word_1066&quot; pos=&quot;UH&quot;/> <markable id=&quot;markable_668&quot; span=&quot;word_1067&quot; pos=&quot;,&quot;/> <markable id=&quot;markable_669&quot; span=&quot;word_1068&quot; pos=&quot;NN&quot;/> <markable id=&quot;markable_670&quot; span=&quot;word_1069&quot; pos=&quot;VBZ&quot;/> <markable id=&quot;markable_671&quot; span=&quot;word_1070&quot; pos=&quot;DT&quot;/> <markable id=&quot;markable_672&quot; span=&quot;word_1071&quot; pos=&quot;NNP&quot;/> <markable id=&quot;markable_673&quot; span=&quot;word_1072&quot; pos=&quot;NNP&quot;/> <markable id=&quot;markable_674&quot; span=&quot;word_1073&quot; pos=&quot;NNP&quot;/> <markable id=&quot;markable_675&quot; span=&quot;word_1074&quot; pos=&quot;NN&quot;/> <markable id=&quot;markable_676&quot; span=&quot;word_1075&quot; pos=&quot;IN&quot;/> <markable id=&quot;markable_677&quot; span=&quot;word_1076&quot; pos=&quot;IN&quot;/> <markable id=&quot;markable_678&quot; span=&quot;word_1077&quot; pos=&quot;NNP&quot;/> <markable id=&quot;markable_679&quot; span=&quot;word_1078&quot; pos=&quot;.&quot;/></Paragraph> </Section> <Section position="5" start_page="109" end_page="111" type="metho"> <SectionTitle> 3 Simplified MMAXQL </SectionTitle> <Paragraph position="0"> Simplified MMAXQL is a variant of the MMAXQL query language. It offers a simpler and more concise way to formulate certain types of queries for multi-level annotated corpora. Queries are automatically converted into the underlying query language and then executed. A query in simplified MMAXQL consists of a sequence of query tokens which are combined by means of relation operators. Each query token queries exactly one basedata element (i.e. word) or one markable.</Paragraph> <Section position="1" start_page="109" end_page="110" type="sub_section"> <SectionTitle> 3.1 Query Tokens </SectionTitle> <Paragraph position="0"> Basedata elements can be queried by matching regular expressions. Each basedata query token consists of a regular expression in single quotes, which must exactly match one basedata element. The query '[Tt]he' matches all definite articles, but not e.g. ether or there. For the latter two words to also match, wild-cards have to be used: '.*[Tt]he.*' Sequences of basedata elements can be queried by simply concatenating several space-separated3 tokens. The query '[Tt]he [A-Z].+' will match sequences consisting of a definite article and a word beginning with a capital letter. Markables are the carriers of the actual annotation information. They can be queried by means of string matching and by means of attribute-value combinations. A markable query token has the form string/conditions where string is an optional regular expression and conditions specifies which attribute(s) the markable should match. The most simple 'condition' is just the name of a markable level, which will match all markables on that level. If a regular expression is also supplied, the query will return only the matching markables. The query [Aa]n?\s.*/ref exp4 will return all markables from the ref exp level beginning with the indefinite article.</Paragraph> <Paragraph position="1"> The conditions part of a markable query token can indeed be much more complex. A main feature of simplified MMAXQL is that redundant parts of conditions can optionally be left out, making queries very concise. For example, the markable level name can be left out if the name of the attribute accessed by the query is unique across all active markable levels. Thus, the query /!coref class=empty can be used to query markables from the ref exp level which have a non-empty value in the coref class attribute, granted that only one attribute of this name exists.5 The same applies to the names of nominal attributes if the value specified in the query unambiguously points to this attribute. Thus, the query ambiguated by prepending the markable level name. can be used to query markables from the pos level which have the value pn, granted that there is exactly one nominal attribute with the possible value pn. Several conditions can be combined into one query token. Thus, the query /{poss det,pron},!coref class=empty returns all markables from the ref exp level that are either possessive determiners or pronouns and that are part in some coreference set.6</Paragraph> </Section> <Section position="2" start_page="110" end_page="111" type="sub_section"> <SectionTitle> 3.2 Relation Operators </SectionTitle> <Paragraph position="0"> The whole point of querying corpora with multi-level annotation is to relate markables from different levels to each other. The reference system with respect to which the relation between different markables is established is the sequence of basedata elements, which is the same for all markables on all levels. Since this bears some resemblance to different events occurring in several temporal relations to each other, we (like also Heid et al. (2004), among others) adopt this as a metaphor for expressing the sequential and hierarchical relations between markables, and we use a set of relation operators that is inspired by (Allen, 1991). This set includes (among others) the operators before, meets (default), starts, during/in, contains/dom, equals, ends, and some inverse relations. The following examples give an idea of how individual query tokens can be combined by means of relation operators to form complex queries. The example uses the ICSI meeting corpus of spoken multi-party dialogue.7 This corpus contains, among others, asegmentlevel with markables roughly corresponding to speaker turns, and ametalevel containing markables representing e.g. pauses, emphases, or sounds like breathing or mike noise. These two levels and the basedata level can be combined to retrieve instances of you know that occur in segments spoken by female speakers8 which also contain a pause or an emphasis: Relation operators for associative relations (i.e.</Paragraph> <Paragraph position="1"> markable set and markable pointer) are nextpeer, anypeer and nexttarget, anytarget, respectively. Assuming the sample data from Section 2, the query /ref_exp nextpeer:coref_class /ref_exp retrieves pairs of anaphors (right) and their direct antecedents (left). The query can be modified to /ref_exp nextpeer:coref_class (/ref_exp equals /pron) to retrieve only anaphoric pronouns and their direct antecedents.</Paragraph> <Paragraph position="2"> If a query is too complex to be expressed as a single query token sequence, variables can be used to store intermediate results of sub-queries. The following query retrieves pairs of utterances (incl. the referring expressions embedded into them) that are more than 30 tokens9 apart, and assigns the resulting 4-tuples to the variable $distant utts.</Paragraph> <Paragraph position="3"> (/utterances dom /ref_exp) before:31- (/utterances dom /ref_exp) -> $distant_utts The next query accesses the second and last column in the temporary result (by means of the zero-based column index) and retrieves those pairs of anaphors and their direct antecedents that occur in utterances that are more than 30 tokens apart: $distant_utts.1 nextpeer:coref_class $distant_utts.3</Paragraph> </Section> </Section> class="xml-element"></Paper>