File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1207_metho.xml

Size: 29,709 bytes

Last Modified: 2025-10-06 14:15:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1207">
  <Title>B B B Automation of Treebank Annotation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Annotating Argument Structure
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Annotation Scheme
</SectionTitle>
      <Paragraph position="0"> Unlike most treebanks of English, our corpus is annotated with predicate-argumenl s~ructures and not phrase-structure trees. The reason is the free word order in German, a feature seriously affecting the transparency of traditional phrase structures. Thus local and non-local dependencies are represented in</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
VMFIN VVPP VAINF
</SectionTitle>
    <Paragraph position="0"> must thought-over be 'it has to be thought over' about-it $.</Paragraph>
    <Paragraph position="1"> Figure h Sample structure from the Treebank the same way, at the cost of allowing crossing tree branches, as shown in figure 11 . Such a direct representation of the predicate-argument relation makes annotation easier than it would be if additional trace-filler co-references were used for encoding discontinuous constituents. Furthermore, our scheme facilitates automatic extraction of valence frames and the construction of semantic representations.</Paragraph>
    <Paragraph position="2"> On the other hand, the predicate-argument structures used for annotating our corpus can still be converted automatically into phrase-structure trees if necessary, cf. (Skut et al., 1997a). For more details on the annotation scheme v. (Skut et al., 1997b).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Annotation Mode
</SectionTitle>
      <Paragraph position="0"> In order to make annotation more reliable, each sentence is annotated independently by two annotators.</Paragraph>
      <Paragraph position="1"> Afterwards, the results are compared, and both annotators have to agree on a unique structure. In 1 The nodes and edges are labeled with category and function symbols, respectively (see appendix A).</Paragraph>
      <Paragraph position="2"> Brants and Skut 49 Automation of Treebank Annotation Thorsten Brants and Wojciech Skut (1998) Automation of Treebank Annotation. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 49-57. case of persistent disagreement or uncertainty, the grammarian supervising the annotation work is consulted. null It has turned out that comparing annotations involves significantly more effort than annotation proper. As we do not want to abandon the annotateand-compare strategy, additional effort has been put into the development of tools supporting the comparison of annotated structures (see section 4).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Automation
</SectionTitle>
    <Paragraph position="0"> The efficiency of annotation can be significantly increased by using automatic annotation tools. Nevertheless, some form of human supervision and hand-correction is necessary to ensure sufficient reliability. As pointed out by (Marcus, Santorini, and Marcinkiewicz, 1994), such a semi-automatic annotation strategy turns out to be superior to purely manual annotation in terms of accuracy and efficiency. Thus in most treebank projects, the task of the annotators consists in correcting the output of a parser, cf. (Marcus, Santorini, and Marcinkiewicz, 1994), (Black et al., 1996).</Paragraph>
    <Paragraph position="1"> As for our project, the unavailability of broad-coverage argument-structure and dependency parsers made us adopt a bootstrapping strategy.</Paragraph>
    <Paragraph position="2"> Having started with completely manual annotation, we axe gradually increasing the degree of automation. The corpus annotated so far serves as training material for annotation tools based on statistical NLP methods, and the degree of automation increases with the amount of annotated sentences.</Paragraph>
    <Paragraph position="3"> Automatic processing and manual input are combined interactively: the annotator specifies some information, another piece of information is added automatically, the annotator adds new information or corrects parts of the structure, new parts are added automatically, and so on. The size and type of such annotation increments depends on the size of the training corpus. Currently, manual annotation consists in specifying the hierarchical structure, whereas category and function labels as well as simple sub-structures are assigned automatically. These automation steps are described in the following sections. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Tagging Grammatical Functions
</SectionTitle>
      <Paragraph position="0"> Assigning grammatical functions to a given hierarchical structure is based on a generalization of standard paxt-of-speech tagging techniques.</Paragraph>
      <Paragraph position="1"> In contrast to a standard probabilistic POS tagger (e.g. (Cutting et al., 1992; Feldweg, 1995)), the tagger for grammatical functions works with lexical  and contextual probability measures PO.(') depending on the category of a mother node (Q). This additional parameter is necessary since the sequence of grammatical functions depends heavily on the type of phrase in which it occurs. Thus each category (S, VP, NP, PP etc.) defines a separate Markov model.</Paragraph>
      <Paragraph position="2"> Under this perspective, categories of daughter nodes correspond to the outputs of a Markov model (i.e., like words in POS tagging). Grammatical functions can be viewed as states of the model, analogously to tags in a standard part-of-speech tagger. Given a sequence of word and phrase categories T = T1...Tk and a parent category Q, we cal-</Paragraph>
      <Paragraph position="4"> and (using a trigram model)</Paragraph>
      <Paragraph position="6"> The contexts are smoothed by linear interpolation of unigrams, bigrams, and trigrams. Their weights are calculated by deleted interpolation (Brown et al.,  The structure of a sample sentence is shown jn figure 2. Here, the probability of the S node having this particular sequence of children is calculated as</Paragraph>
      <Paragraph position="8"> ($ indicates the start of the sequence).</Paragraph>
      <Paragraph position="9"> The predictions of the tagger are correct in approx. 94% of all cases. During the annotation process this is further increased by exploiting a precision/recall trade-off (cf. section 3.5).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Tagging Phrasal Categories
</SectionTitle>
      <Paragraph position="0"> The second level of automation is the recognition of phrasal categories, which frees the annotator from typing phrase labels. The task is performed by an extension of the grammatical function tagger presented in the previous section.</Paragraph>
      <Paragraph position="1"> Recall that each phrasal category defines a different Markov model. Given the categories of the children nodes in a phrase, we can run these models in parallel. The model that assigns the most probable sequence of grammatical functions determines the category label to be assigned to the parent node.</Paragraph>
      <Paragraph position="2"> Formally, we calculate the phrase category Q (and at the same time the sequence of grammatical functions G = G1 ... Gk) on the basis of the sequence of daughters 7&amp;quot; = T1 ... Tk with argmax maxPQ(GIT). Q 6 This procedure can also be performed using one large {combined) Markov model that enables a very efficient calculation of the maximum.</Paragraph>
      <Paragraph position="3"> The overall accuracy of this approach is 95%.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Tagging Hierarchical Structure
</SectionTitle>
      <Paragraph position="0"> The next automation step is the recognition of syntactic structures. In general, this task is much more difficult than assigning category and function labels, and requires a significantly larger training corpus than the one currently available. What can be done at the present stage is the recognition of relatively simple structures such as NPs and PPs.</Paragraph>
      <Paragraph position="1"> (Church, 1988) used a simple mechanism to mark the boundaries of NPs. He used part-of-speech tagging and added two flags to the part-of-speech tags to mark the beginning and the end of an NP.</Paragraph>
      <Paragraph position="2"> Our goal is more ambitious in that we mark not only the phrase boundaries of NPs but also the complete structure of a wider class of phrases, starting with APs, NPs and PPs.</Paragraph>
      <Paragraph position="3">  assign two types of tags (start X and join X, where X denotes the type of the phrase) combined with a process to build trees.</Paragraph>
      <Paragraph position="4"> We go one step further and assign simple structures in one pass. Furthermore, the nodes and branches of these tree chunks have to be assigned category and function labels.</Paragraph>
      <Paragraph position="5"> The basic idea is to encode structures of limited depth using a finite number of tags. Given a sequence of words (w0, wl .... wn/, we consider the structural relation ri holding between wi and wi-1 for 1 &lt; i &lt; n. For the recognition of NPs and PPs, it is sufficient to distinguish the following seven values of rl which uniquely identify sub-structures of limited depth.</Paragraph>
      <Paragraph position="7"> If more than one of the conditions above are met, the first of the corresponding tags in the list is assigned. A structure tagged with these symbols is shown in figure 3.</Paragraph>
      <Paragraph position="8"> In addition, we encode the POS tag ti assigned to w~. On the basis of these two pieces of information we define structural tags as pairs Si = (ri, ti). Such Brants and Skut 51 Automation of Treebank Annotation tags constitute a finite alphabet of symbols describing the structure and syntactic category of phrases of depth &lt; 3.</Paragraph>
      <Paragraph position="9"> The task is to assign the most probable sequence of structural tags ((So, $1, ..., Sn)) to a sequence of part-of-speech tags (To, T1, ..., Tn).</Paragraph>
      <Paragraph position="10"> Given a sequence of part-of-speech tags T =</Paragraph>
      <Paragraph position="12"> The part-of-speech tags are encoded in the structural tag (t), so S uniquely determines T. Therefore, we have P(T\[S) = 1 ifTi = ii and 0 otherwise, which simplifies calculations:</Paragraph>
      <Paragraph position="14"> As in the previous models, the contexts are smoothed by linear interpolation of unigrams, bigrams, and trigrams. Their weights are calculated by deleted interpolation.</Paragraph>
      <Paragraph position="15"> This chunk tagging technique can be applied to treebank annotation in two ways. Firstly, we could use it as a preprocessor; the annotator would then complete and correct the output of the chunk tagger. The second alternative is to combine this chunking with manual input in an interactive way. Then the annotator has to determine the boundaries of the sub-structure that is to be build by the program.</Paragraph>
      <Paragraph position="16"> Obviously, the second solution is favorable since the user supplies information about chunk boundaries, while in the preprocessing mode the tagger has to find both the boundaries and the internal structure of the chunks.</Paragraph>
      <Paragraph position="17"> The assignment of structural tags is correct in more than 94% of the cases. For detailed results see section 3.6.3.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Interaction and Alternation
</SectionTitle>
      <Paragraph position="0"> To illustrate the interaction of manual input and the automatic annotation techniques described above, we show the way in which the structure in figure 3 is constructed. The current version of the annotation tool supports automatic assignment of category and phrase labels, so the user has to specify the hierarchical structure step by step 2.</Paragraph>
      <Paragraph position="1"> The starting point is the plain string of words together with their part-of-speech tags. The annotator first selects the words Tel Aviv and executes the command &amp;quot;group&amp;quot; (this is all done with the mouse). Then the program inserts the category label MPN (multi-lexeme proper noun) and assigns the grammatical function PNC (proper noun component) to both words (cf. sections 3.2 and 3.1).</Paragraph>
      <Paragraph position="2"> Having completed the first sub-structure, the annotator selects the newly created MPN and the preposition in, and creates a new phrase. The tool automatically inserts the phrase label PP and the grammatical functions AC (adpositional case marker) and NK (noun kernel component). The following two steps are to determine the components of the AP and, finally, those of the NP.</Paragraph>
      <Paragraph position="3"> At any time, the annotator has the opportunity to change and correct entries made by the program.</Paragraph>
      <Paragraph position="4"> This interactive annotation mode is favorable from the point of view of consistency checking. The first reason is that the annotation increments are rather small, so the annotator corrects not an entire parse tree, but a fairly simple local structure. The automatic assignment of phrase and function labels is generally more reliable than manual input because it is free of typically human errors (see the precision results in (Brants, Skut, and Krenn, 1997)). Thus the annotator can concentrate on the more difficult task, i.e., building complex syntactic structures.</Paragraph>
      <Paragraph position="5"> The second reason is that errors corrected at lower levels in the structure facilitate the recognition of structures at higher levels, thus many wrong readings are excluded by confirming or correcting a choice at a lower level.</Paragraph>
      <Paragraph position="6"> The partial automation of the annotation process (automatic regocnition of phrase labels and grammatical functions) has reduced the average annotation time from about 10 to 1.5 - 2 minutes per sentence, i.e. 600 - 800 tokens per minute, which is comparable to the figures published by the creators of the Penn Treebank in (Marcus, Santorini, and Marcinkiewicz, 1994).</Paragraph>
      <Paragraph position="7"> The test version of the annotation tool using the statistical chunking technique described in section 3.3 permits even larger annotation increments and we expect a further increase in annotation speed.</Paragraph>
      <Paragraph position="8"> The user just has to select the words constituting an ~The chunk tagger has not yet been fully integrated into the annotation tool.</Paragraph>
      <Paragraph position="10"> NP or PP. The program assigns a sequence of structural tags to them; these tags are then converted to a tree structure and all labels are inserted.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 tteliability
</SectionTitle>
      <Paragraph position="0"> To make automatic annotation more reliable, the program assigning labels performs an additional reliability check. We do not only calculate the best assignment, but also the second-best alternative and its probability. If the probability of the alternative comes very close to that of the best sequence of labels, we regard the choice as unreliable, and the annotator is asked for confirmation.</Paragraph>
      <Paragraph position="1"> Currently, we employ three reliability levels, expressed by quotients of probabilities Pbest/Psecond-If this quotient is close to one (i.e., smaller than some threshold 01), the decision counts as unreliable, and annotation is left to the annotator. If the quotient is very large (i.e., greater than some threshold 02 &gt; 91), the decision is regarded as reliable and the respective annotation is made by the program.</Paragraph>
      <Paragraph position="2"> If the quotient fails between 91 and 02, the decision is tagged as &amp;quot;almost reliable&amp;quot;. The annotation is inserted by the program, but has to be confirmed by the annotator.</Paragraph>
      <Paragraph position="3"> This method enables the detection of a number of errors that are likely to be missed if the annotator is not asked for confirmation.</Paragraph>
      <Paragraph position="4"> The results of using these reliability levels are reported in the experiments section below.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.6 Experiments
</SectionTitle>
      <Paragraph position="0"> This section reports on the accuracy achieved by the methods described in the previous sections.</Paragraph>
      <Paragraph position="1"> At present, our corpus contains approx. 6300 sentences (115,000 tokens) of German newspaper text (Frankfurter Rundschan). Results of tagging grammatical functions and phrase categories have improved slightly compared to those reported for a smaller corpus of approx. 1200 sentences (Brants, Skut, and Krenn, 1997). Accuracy figures for tagging the hierarchical structure are published for the first time.</Paragraph>
      <Paragraph position="2"> For each experiment, the corpus was divided into two disjoint parts: 90% training data and 10% test data. This procedure was repeated ten times, and the results were averaged.</Paragraph>
      <Paragraph position="3"> The thresholds 01 and 02 determining the reliability levels were set to 91 = 5 and 02 = 100.</Paragraph>
      <Paragraph position="4">  We employ the technique described in section 3.1 to assign grammatical functions to a structure defined by an annotator. Grammatical functions are  cases in which the tagger assigned a correct grammatical function (or would have assigned ifa decision had been forced).</Paragraph>
      <Paragraph position="5">  represented by edge labels. Additionally, we exploit the recall/accuracy tradeoff as described in section 3.5. The tagset of grammatical functions consists of 45 tags.</Paragraph>
      <Paragraph position="6">  Tagging results are shown in table 1. Overall accuracy is 94.6%. 88% of all predictions are classified as reliable, which is the most important class for the actual annotation task. Accuracy in this class is 97.0%. It depends on the category of the phrase, e.g. accuracy for reliable cases reaches 99% for 51Ps and PPs.</Paragraph>
      <Paragraph position="7">  Now the task is to assign phrasal categories to a structure specified by the annotator, i.e., only the hierarchical structure is given. We employ the technique of competing Markov models as described in section 3.2 to assign phrase categories to the structure. Additionally, we compute alternatives to assign one of the three reliability levels to each decision as described in section 3.5. The tagset for phrasal categories consists of 25 tags.</Paragraph>
      <Paragraph position="8"> As can be seen from table 2, the results of assigning phrasal categories are even better than those of assigning grammatical functions. Overall accuracy is 95.4%. Tags that are regarded as reliable (76% of all cases) have an accuracy of 99.0%, which results Brants and Skut 53 Automation of Treebank Annotation  The chunk tagger described in section 3.3 assigns tags encoding structural information to a sequence of words and tags. The accuracy figures presented here refer to the correct assignments of these tags (see table 3).</Paragraph>
      <Paragraph position="9"> The assignment of structural tags allows us to construct a tree; the labels are afterwards assigned in a bottom-up fashion by the function/category label tagger described in earlier sections.</Paragraph>
      <Paragraph position="10"> Overall accuracy is 94.4% and reaches 95.8% in the reliable cases.</Paragraph>
      <Paragraph position="11"> A different measure of the chunker's correctness is the percentage of complete phrases recognized correctly. In order to determine this percentage, we extracted all chunks of the maximal depth recognizable by the chunker. In a cross evaluation, 87.3% of these chunks were recognized correctly as far as the hierarchical structure is concerned.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Comparing Trees
</SectionTitle>
    <Paragraph position="0"> Annotations produced by different annotators are compared automatically and differences are marked.</Paragraph>
    <Paragraph position="1"> The output of the comparison is given to the annotators. First, each of the annotators goes through the differences on his own and corrects obvious errors. Then remaining differences are resolved in a discussion of the annotators.</Paragraph>
    <Paragraph position="2"> Additionally, the program calculates the probabilities of the two different annotations. This is intended to be a first step towards resolving conflicting annotations automatically. Both parts, tree matching and the calculation of probabilities for complete trees are described in the following sections.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Tree Matching
</SectionTitle>
      <Paragraph position="0"> The problem addressed here is the comparison of two syntactic structures that share identical terminal nodes (the words of the annotated sentence).</Paragraph>
      <Paragraph position="1"> proc compare(A, B) for each non-terminal node X in A: search node Y in B such that yield(X) = yield(Y) if Y exists: emit different labels if any if Y does not exist:  annotation A with annotation B of the same sentence null (Calder, 1997) presents a method of comparing the structure of context free trees found in different annotations. This section presents an extension of this algorithm that compares predicate-argument structures possibly containing crossing branches (cf. figure 2). Node and edge labels, representing phrasal categories and grammatical functions, are also taken into account.</Paragraph>
      <Paragraph position="2"> Phrasal (non-terminal) nodes are compared on the basis of their yields: the yield of a nonterminal node X in an annotation A is the ordered set of terminals that are (directly or indirectly) dominated by X. The yield need not be contiguous since predicate-argument structures allow discontinuous constituents.</Paragraph>
      <Paragraph position="3"> If both annotations contain nonterminal nodes that cover the same terminal nodes, the labels of the nonterminal nodes and their edges are compared. This results in a combined measure of structural and labeling differences, which is very useful in cleaning the corpus and keeping track of the development of the treebank.</Paragraph>
      <Paragraph position="4"> We use the basic algorithm shown in figure 4 to determine the differences in two annotations A and B. The basic form is asymmetric. Therefore, a complete comparison consists of two runs, one for each direction, and the outputs of both runs are combined. null Figures 5 and 6 show examples of the output of the algorithm. These outputs can be directly used to mark the corresponding nodes and edges.</Paragraph>
      <Paragraph position="5"> The yield is sufficient to uniquely determine corresponding nodes since the annotations used here do not contain unary branching nodes. If unary branching occurs, both the parent and the child have the same terminal yield and further mechanism to determine corresponding nodes are needed. (Calder, 1997) points out possible solutions to this problem. Brants and Skut 54 Automation of Treebank Annotation  sentence in figure 2 (hie should be attached to S instead of VP), together with the output of the tree comparison algorithm* All nodes are numbered to enable identification. Additionally, this output can be used to highlight the corresponding nodes and edges.</Paragraph>
      <Paragraph position="6">  (1) edge: 5 (ADV) \[NG\] nie (3) edge: 5 (ADV) \[MO\] hie  sentence in figure 2 (nie should have grammatical function NG instead of MO), together with the output of the tree comparison algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Probabilities
</SectionTitle>
      <Paragraph position="0"> The probabilities of each sub-structure of depth one are calculated separately according to the model described in sections 3.1 and 3.2. Subsequently, the product of these probabilities is used as a scoring function for the complete structure* This method is based on the assumption that productions at different levels in a structure are independent, which is inherent to context free rules.</Paragraph>
      <Paragraph position="1"> Using the formulas from sections 3.1 and 3.2, the probability P(A) of an annotation A is evaluated as</Paragraph>
      <Paragraph position="3"> A annotation (structure) for a sentence nnt number of nonterminal nodes in A nt number of terminal nodes in A n number of nodes = nnt + nt Qi ith phrase in A T/ sequence of tags in Qi Gi sequence of gramm, func. in Qi ki number of elements in Qi tij tag of jth child in Qi gl,i grammatical function of jth child in Qi Probabilities computed in this way cannot be used directly to compare two annotations since they favor annotations with fewer nodes. Each new nonterminal node introduces a new element in the product and makes it smaller.</Paragraph>
      <Paragraph position="4"> Therefore, we normalize the probabilities w.r.t. the number of nodes in the annotation, which yields the perplexity PP(A) of an annotation A:</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Application to a Corpus
</SectionTitle>
      <Paragraph position="0"> The procedures of tree matching and probability calculation were applied to our corpus, which currently consists of approx. 6300 sentences (115,000 tokens) of German newspaper text, each sentence annotated at least twice.</Paragraph>
      <Paragraph position="1"> We measured the agreement of independent annotations after first annotation but before correction Brants and Skut 55 Automation of Treebank Annotation  (1) ident, parent node (2) ident, gram. func.</Paragraph>
      <Paragraph position="2"> node level: (3) identical nodes (4) identical nodes/labels (5) ident, node/gram, func.</Paragraph>
      <Paragraph position="3"> sentence level: (6) identical structure (7) identical annotation  sion (2), which is the current stage of the corpus. The results are shown in table 4.</Paragraph>
      <Paragraph position="4"> As for measuring differences, we can count them at word, node and sentence level.</Paragraph>
      <Paragraph position="5"> At the word level, we are interested in (1) the number of correctly assigned parent categories (does a word belong to a PP, NP, etc.?), and (2) the number of correctly assigned grammatical functions (is a word a head, modifier, subject, etc.?). At the node level (non-terminals, phrases) we measure (3) the number of identical nodes, i.e., if there is a node in one annotation, we check whether it corresponds to a node in the other annotation having the same yield. Additionally, we count (4) the number of identical nodes having the same phrasal category, and (5) the number of identical nodes having the same phrasal category and the same grammatical function within its parent phrase. At the sentence level, we measure (6) the number of annotated sentences having the same structure, and, which is the strictest measure, (7) the number of sentences having the same structure and the same labels (i.e., exactly the same annotation).</Paragraph>
      <Paragraph position="6"> At the node level, we find 87.6% agreement in independent annotations. A large amount of the differences come from misinterpretation of the annotation guidelines by the annotators and are eliminated after comparison, which results in 98.1% agreement. This kind of comparison is the one most frequently used in the statistical parsing community for comparing parser output.</Paragraph>
      <Paragraph position="7"> The sentence level is the strictest measure, and the agreement is low (34.6% identical annotations after first annotation, 87.9% after comparison). But at this level, one error (e.g. a wrong label) renders Table 5: Using model perplexities to compare different annotations: Accuracy of using the hypothesis that a correct annotation has a lower perplexity than a wrong annotation.</Paragraph>
      <Paragraph position="8">  the whole annotation to be wrong and the sentence counts as an error.</Paragraph>
      <Paragraph position="9"> If we make the assumption that a correct annotation always has a lower perplexity than a wrong annotation for the same sentence, the system would make a correct decision for 65.8% of the sentences (see table 5, last row).</Paragraph>
      <Paragraph position="10"> For approx. 70% of all sentences, at least one of the initial annotations was completely correct. This means that the two initial annotations and the automatic comparison yield a corpus with approx. 65.8% x 70% = 46% completely correct annotations (complete structure and all tags).</Paragraph>
      <Paragraph position="11"> One can further increase precision at the cost of recall by requiring the difference in perplexity to exceed some minimum distance. This precision/recall tradeoff is also shown in table 5.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML