File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1014_metho.xml

Size: 20,092 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1014">
  <Title>An Annotation Scheme for Free Word Order Languages</Title>
  <Section position="4" start_page="88" end_page="88" type="metho">
    <SectionTitle>
S ~S#t
</SectionTitle>
    <Paragraph position="0"> Adv~ V NP#2 NP I I V / \ daran e#1 wird ihn Anna e#e e#.~ erkennen, dass erweint The fairly short sentence contains three non-local dependencies, marked by co-references between traces and the corresponding nodes. This hybrid representation makes the structure less transparent, and therefore more difficult to annotate.</Paragraph>
    <Paragraph position="1"> Apart from this rather technical problem, two further arguments speak against phrase structure as the structural pivot of the annotation scheme: * Phrase structure models stipulated tbr nonconfigura.tionM languages differ strongly from each other, presenting a challenge to the intended theory-independence of the schelne.</Paragraph>
    <Paragraph position="2"> * Constituent structure serves as an exl)la.natory device for word order variation, which is difficult to reconcile with the descriptivity requirement.</Paragraph>
    <Paragraph position="3"> Finally, the structural handling of free word order means stating well-formedness constraints on structures involving many trace-filler dependencies, which ha:s proved tedious. Since most methods of handling discontinuous constituents make the fornaalism more powerfifl, the efficiency of processing deteriorates, too.</Paragraph>
    <Paragraph position="4"> An Mternative solution is to make argurnent structure the main structural component of the formalism. This assumption underlies a growing number of recent syntactic theories which give up the context-free constituent ba.ckbone, cf. (McCawley, 1987), (Dowty, 1989), (Reape, 1993), (Kathol and Pollard, 1995). These approaches provide an adequate explanation for several issues problematic ibr phrase-structure grammars (clause union, extraposition, diverse second-position phenomena).</Paragraph>
    <Section position="1" start_page="88" end_page="88" type="sub_section">
      <SectionTitle>
2.4 Annotating Argument Structure
</SectionTitle>
      <Paragraph position="0"> Argument structure can be represented in terms of unordered trees (with crossing branches). In order to reduce their ambiguity potential, rather simple, 'flat' trees should be employed, while more information can be expressed by a rich system of function labels.</Paragraph>
      <Paragraph position="1"> Furthermore, the required theory-independence means that the form of syntactic trees should not reflect theory-specific assumptions, e.g. every syntactic structure has a unique hea.d. Thus, notions such as head should be distinguished at the level of syntactic flmctions rather than structures. This requirement speaks against the traditional sort of d~:pendency trees, in which heads are represented as non-terminal nodes, cf. (Hudson, 1984).</Paragraph>
      <Paragraph position="2"> A tree meeting these requirements is given below:</Paragraph>
      <Paragraph position="4"> daran wird ihn Anna erkennen, &amp;tss er weint Such a word order independent representation has the advantage of all structural ini'orrrlation being encoded in a single data structure. A unifbrm representation of local and non-local dependencies makes the structure more transparent 1 .</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="88" end_page="89" type="metho">
    <SectionTitle>
3 The Annotation Scheme
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="88" end_page="89" type="sub_section">
      <SectionTitle>
3.1 Architecture
</SectionTitle>
      <Paragraph position="0"> Argument structure, represented in terms of unordered trees.</Paragraph>
      <Paragraph position="1"> Grammatical functions, encoded in edge labels, e.g. SB (subject), MO (modifier), HD (head). Syntactic categories, expresse(l by category labels assigned to non-terminal nodes and by part-of-speech tags assigned to terlninals.</Paragraph>
    </Section>
    <Section position="2" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
3.2 Argulnent Structure
</SectionTitle>
      <Paragraph position="0"> A structure for (2) is shown in fig. 2.</Paragraph>
      <Paragraph position="1"> (2) schade, dM~ kein Arzt anwesend ist, tier pity that no doctor present is who sich auskennt is competent 'Pity that no competent doctor is here' Note that the root node does not have a head descendant (HD) as the sentence is a predicative construction consisting of a subject (SB) and a predicate (PD) without a copula. The subject is itself a sentence in which the copula (is 0 does occur and is assigned the tag HD 2.</Paragraph>
      <Paragraph position="2"> The tree resembles traditional constituent structures. The difference is its word order independence: structural units (&amp;quot;phrases&amp;quot;) need not be contiguous substrings. For instance, the extraposed relative clause (RC) is still treated as part of the subject NP.</Paragraph>
      <Paragraph position="3"> As the annotation scheme does not distinguish different bar levels or any similar intermediate categories, only a small set of node labels is needed (currently 16 tags, S, NP, AP ...).</Paragraph>
    </Section>
    <Section position="3" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
3.3 Grammatical Functions
</SectionTitle>
      <Paragraph position="0"> Due to the rudimentary character of the argument structure representations, a great deal of reformation has to be expressed by gramnlatical functions. Their further classification must reflect different kinds of linguistic information: morphology (e.g., case, inflection), category, dependency type (complementation vs. modification), thematic role, etc. 3 However, there is a trade-off between the granularity of information encoded in the labels and the speed and accuracy of annotation. In order to avoid inconsistencies, the corpus is annotated in two stages: basic annotalion and r'efincment. While in the first phase each annotator has to annotate structures as well as categories and functions, the refinement can be done separately for each representation level.</Paragraph>
      <Paragraph position="1"> During the first, phase, the focus is on almotating correct structures and a coarse-grained classification of grammatical functions, which represent the following areas of information: 2CP stands for conwlementizer, OA for accusative object and RC for relative clause. NK denotes a 'kernel NP' component (v. section 4.1).</Paragraph>
      <Paragraph position="2"> aFor an extensive use of gr;tnllnaticM functions Cf.</Paragraph>
      <Paragraph position="3"> (K~trlsson et al., 1995), (Voutilainen, 1994).</Paragraph>
      <Paragraph position="4"> Dependency type: complemcnls are fllrther classified according to features su(:h as category and case: clausal complements (OC), accusative objects (OA), datives (DA), etc. Modifiers are assigned the label MO (further classification with respect to thematic roles is planned). Separate labels are defined for dependencies that do not fit the complement/modifier dichotomy, e.g., pre- (GL) and postnominal genitives (GR).</Paragraph>
      <Paragraph position="5"> Headedness versus non-headedness: Headed and non-headed structures are distinguished by the presence or absence of a branch labeled HD.</Paragraph>
      <Paragraph position="6"> Morphological information: Another set of labels represents morphological information. PM stands for moTThological partich, a label tbr German infinitival zu aml superlative am. Separable verb prefixes are labeled SVP.</Paragraph>
      <Paragraph position="7"> During the second annotation stage, the annotation is enriched with information about, thematic roles, quantifier scope and anaphoric ret)rence. As already mentioned, this is done separately for each of the three information areas.</Paragraph>
    </Section>
    <Section position="4" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
3.4 Structure Sharing
</SectionTitle>
      <Paragraph position="0"> A phrase or a lexical item can perform multiple functions in a sentence. Consider ~.qui verbs where the subject of the infinitival VP is not realised syntactically, but co-referent with the subject or object of the matrix equi verb: (3) er bat reich ZU kolnlnen he asked me to come (mich is the imderstood subject of komm~.u.). In such cases, an additional edge is drawn from tim embed(led VP node to the controller, thus changing the syntactic tree into a graph. We call such additional edges secondary links and represent them as dotted lines, see fig. 4, showing the structure of (3).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="89" end_page="90" type="metho">
    <SectionTitle>
4 Treatment of Selected Phenomena
</SectionTitle>
    <Paragraph position="0"> As theory-independence is one of our objectives, the annotation scheme incorporates a number of widely accepted linguistic analyses, especially ill the area of verbal, adverbial and adjectival syntax. However, some other s~andard analyse.s turn out to be proMemarie, mainly due to the partial, idealised character of competence grammars, which often margmalise or ignore such important phenolnena as 'deficient' (e.g. headless) constructions, apl)ositions, temporal expressions, etc.</Paragraph>
    <Paragraph position="1"> In the following paragraphs, we give annotations for a number of such phenomena.</Paragraph>
    <Section position="1" start_page="89" end_page="90" type="sub_section">
      <SectionTitle>
4.1 Noun Phrases
</SectionTitle>
      <Paragraph position="0"> Most linguistic theories treat NPs as structures hea(led by a unique lexical item (no,m) However, this  idealised model needs severa.l additional assumptions in order to account for such important phenomena as complex norninal NP components (cf. (4))  or nominalised a.djectives (of. (5)).</Paragraph>
      <Paragraph position="1"> (4) my uncle Peter Smith (5) tier sehr (41iickliche the very lta.ppy 'tire very ha.pl)y one'  In (4), different theories make different headedness predictions. In (5), either a lexical nominalisation rule for the adjective Gliicklichc is stipulated, or the existence of an empty nominal head. Moreover, the so-called DP analysis views the article der as the head of the phrase. Further differences concern the a.ttachment of the degree modifier ,ehr.</Paragraph>
      <Paragraph position="2"> Because of the intended theory-independence of the scheme, we annotate only the cornmon rninimum. We distinguish an NP kernel consisting of determiners, a.djective phrases and nouns. All components of this kernel are assigned the label NK aml trea.ted as sibling nodes.</Paragraph>
      <Paragraph position="3"> The diff&gt;rence between the particular NK's lies in the positional and part-of-speech information, which is also sufficient to recover theory-specific structures frorn our 'underspecified' representations. For instance, the first determiner among the NK's can be treated as the specifier of the phrase. The head of the phrase can be determined in a similar way according to theory-specific assumptions.</Paragraph>
      <Paragraph position="4"> In addition, a number of clear-cut NP components can be defined outside that juxtapositional kernel: pre- and postnorninal genitives (GL, GR), relative clauses (RC), clausal and sentential complements (OC). They are all treated as siblings of NK's regardless of their position (in situ or extraposed).</Paragraph>
    </Section>
    <Section position="2" start_page="90" end_page="90" type="sub_section">
      <SectionTitle>
4.2 Attaehlnent Ainbiguities
</SectionTitle>
      <Paragraph position="0"> Adjunct attachment often gives rise to structural ambiguities or structural uncertainty. However, fill or partial disambiguation takes place in context, and the annotators do not consider unrealistic readings.</Paragraph>
      <Paragraph position="1"> In addition, we have adopted a simple convention for those cases in which context information is insufficient f~)r total disaml~iguat,ion: the highest possible attachment site is chosen.</Paragraph>
      <Paragraph position="2"> A similar convention has been adopted ibr constructions in which scope ambiguities ha.ve syntactic effe, cts but a. one-to-one correspondence between scope a.nd attachment does not seem reasonable, cf.</Paragraph>
      <Paragraph position="3"> focus particles such a.s only or also. If the scope of such a word does not directly correspond to a tree node, the word is attached to the lowest node dominating all subconstituents a.pl)earing ill its scope.</Paragraph>
    </Section>
    <Section position="3" start_page="90" end_page="90" type="sub_section">
      <SectionTitle>
4.3 Coordination
</SectionTitle>
      <Paragraph position="0"> A problem for the rudimentary a.rgument structure representations is tile use of incomplete structures in natural language, i.e. t)henornena such as coordination and ellipsis. Since a precise structural description of non-constituent coordination would require a rich inventor.), of incomplete phrase types, we have agreed on a sort of nnderspecified representations: the coordinated units are assigned structures in which missing lexical material is not represented at the level of primary links. Fig. 3 shows the representation of the sentence: (6) sie wurde van preuliischen Truppen besetzt site was by Prussiaa, troops occupied und 1887 dem preutlischen Staat angegliedert and 1887 to-the Prussia.n state incorporated 'it was occupied by Prussian troops and incorporated into Prussia i,t 1887' The category of the coordination is labeled CVP here, where C stands for coordination, and VP tar the actual category. This extra, marking makes it easy to distinguish between 'normal' and coordinated categories.</Paragraph>
      <Paragraph position="1"> Multiple coordination as well a.s enumerations are annotated in the same way. An explicit coordinating conjunction need not be present.</Paragraph>
      <Paragraph position="2"> Structure-sharing is expressed using secondary links.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="90" end_page="92" type="metho">
    <SectionTitle>
5 The Annotation Tool
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="90" end_page="90" type="sub_section">
      <SectionTitle>
5.1 Requirenlents
</SectionTitle>
      <Paragraph position="0"> The development of linguistically interpreted corpora, presents a laborious and time-consuming task.</Paragraph>
      <Paragraph position="1"> In order to make the annotation process more efficient, extra effort has been put into the development of an annotation tool.</Paragraph>
      <Paragraph position="2"> The tool supports immediate graphical feedback and automatic error checking. Since our scheme permits crossing edges, visualisa.tion as bracketing and indentation would be insufficient. Instead, the con&gt; plete structure should be represented.</Paragraph>
      <Paragraph position="3"> The tool should also permit a convenient handling of node and edge hd)els. In particular, variable tagsets and label collections should be allowed.</Paragraph>
    </Section>
    <Section position="2" start_page="90" end_page="91" type="sub_section">
      <SectionTitle>
5.2 Implementatioll
</SectionTitle>
      <Paragraph position="0"> As the need for certain flmctionalities becomes obvious with growing annota.tion experience, we have decided to iml)lement the tool in two stages. In the first phase, the ma.in flmctionality for buihling and displaying unordered trees is supplied. In the second phase, secondary links and additional structural flmctions are supported. The implementation of the first phase as described in the following paragraphs is completed.</Paragraph>
      <Paragraph position="1"> As keyboard input is rnore efficient than mouse input (cf. (Lehmalm et al., 1!)95)) rnost effort has been put in developing an efficient keyboard interlace. Menus are supported as a. usefld way of getting  help on commands and labels. In addition to pure annotation, we can attach conlments to structures.</Paragraph>
      <Paragraph position="2"> Figure 1 shows a screen dump of the tool. The largest part of the window contains the graphical representation of tim structure being annot, ate(t. The tbllowing commands are available:  The three tagsets used by the annotation tool (for words, phrases, and edges) are variable and are stored together with the corpus. This allows easy modification if needed. The tool checks the appropriateness of the input.</Paragraph>
      <Paragraph position="3"> For the implementation, we used Tcl/Tk Version 4.1. The corpus is stored in a SQL database.</Paragraph>
    </Section>
    <Section position="3" start_page="91" end_page="91" type="sub_section">
      <SectionTitle>
5.3 Automation
</SectionTitle>
      <Paragraph position="0"> The degree of automation increases with the amount of data available. Sentences annotated in previous steps are used as training material for further processing. We distinguish five degrees of automation:  0) Completely manual annotation.</Paragraph>
      <Paragraph position="1"> 1) The user determines phrase boundaries and syntactic categories (S, NP, etc.). The program automatically assigns grammatical fimetion labels. The annotator can alter the assigned tags. 2) The user only determines the conrponents of a new phrase, the program determines its syntactic category and the grammatical functions of its elements. Again, the annotator has the option of altering the assigned tags.</Paragraph>
      <Paragraph position="2"> 3) Additionally, the program performs simple bracketing, i.e., finds 'kernel' phrases.</Paragraph>
      <Paragraph position="3"> 4) Tile tagger suggests partial or cornplete parses.  So far, about 1100 sentences of our corpus have been annotated. This amount of data suffices as training material to reliably assign the grammatical functions if the user determines the elements of a phrase and its type (step 1 of the list above).</Paragraph>
    </Section>
    <Section position="4" start_page="91" end_page="92" type="sub_section">
      <SectionTitle>
5.4 Assigning GramInatical Function
Labels
</SectionTitle>
      <Paragraph position="0"> Grammatical functions are assigned using standard statistical part-of-speech tagging methods (cf. e.g.</Paragraph>
      <Paragraph position="1"> (Cutting et al., 1992) and (Feldweg, 1995)).</Paragraph>
      <Paragraph position="2"> For a phrase Q with children of type T ...... T~: and grammatical fimctions G,,..., (7~:, we use the lexical probabilities PO(GiITi) and the contextual (trigram) probabilities</Paragraph>
      <Paragraph position="4"> The lexical and contextual probabilities are determined separately for each type of phrase. During annotation, the highest rated granmlatical fimction labels Gi a.re calculated using the Viterbi algorithnr and a.ssigned to the structure, i.e., we. &lt;'Mculate</Paragraph>
      <Paragraph position="6"> To keep the human annotator from missing errors made by the tagger, we additionally calculate the strongest competitor for each label Gi. If its probability is close to the winner (closeness is defined by a threshold on the quotient), the assignment is regarded as unreliable, and the annotator is asked to confirm the assignment.</Paragraph>
      <Paragraph position="7"> For evaluation, the already annota.ted sentences were divided into two disjoint sets, one tbr training (90% of the corpus), the other one tbr testing (10%).</Paragraph>
      <Paragraph position="8"> The procedure was repeated 10 times with different partitionings.</Paragraph>
      <Paragraph position="9"> The tagger rates 90% of all assignments as reliable and carries them out fully automatically. Accuracy for these cases is 97%. Most errors are due to wrong identification of the subject and different kinds of objects in sentences and VPs. Accuracy of the unreliable 10% of assignments is 75%, i.e., the annotator has to alter the choice in 1 of 4 cases when asked ibr confirmation. Overall accuracy of the tagger is 95%.</Paragraph>
      <Paragraph position="10"> Owing to the partial automation, the average annotation efficiency improves by 25% (from around 4 minutes to 3 minutes per sentence).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML