File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1005_metho.xml

Size: 19,253 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1005">
  <Title>Antecedent Recovery: Experiments with a Trace Tagger</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Detecting empty elements
</SectionTitle>
    <Paragraph position="0"> Previous work (Dienes and Dubey, 2003) shows that detecting empty elements can be performed fairly reliably before parsing using a trace tagger, which tags words with information on EEs immediately preceding them. For example, the first occurrence of the word to in our example sentence (2) gets the tag EE=TT-NP, whereas the word wants is tagged as having no EE. The trace tagger uses three main types of features: (i) combination of POS tags in a window of five words around the EEs; (ii) lexical features of the words in a window of three lexical items; and (iii) long-distance cues (Table 2). An EE is correctly detected if and only if (i) the label matches that of the gold standard and (ii) it occurs between the same words. Dienes and Dubey (2003) report 79a3 1% labeled F-score on this evaluation metric, the  best published result on the EE detection task.</Paragraph>
    <Paragraph position="1"> While Dienes and Dubey (2003) report overall scores, they do not evaluate the relative importance of the features used by the tagger. This can be achieved by testing how the model fares if only a subset of the features are switched on (performance analysis). Another way to investigate the problem is to analyze the average weight and the activation frequency of each feature type.</Paragraph>
    <Paragraph position="2"> According to the performance analysis, the most important features are the ones encoding POSinformation. Indeed, by turning only these features on, the accuracy of the system is already fairly high: the labeled F-score is 71a3 2%. A closer look at the feature weights shows that the right context is slightly more informative than the left one. Lexicalization of the model contributes further 6% to the overall score (the following word being slightly more important than the preceding one), whereas the features capturing long-distance cues only improve the overall score by around 2%. Interestingly, long-distance features get higher weights in general, but their contribution to the overall performance is small since they are rarely activated. Finally, the model with only lexical features performs surprisingly well: the labeled F-score is 68a3 9%, showing that a very small window already contains valuable information for the task.</Paragraph>
    <Paragraph position="3"> In summary, the most important result here is that a relatively small window of up to five words contains important cues for detecting EEs.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Antecedent recovery
</SectionTitle>
    <Paragraph position="0"> Antecedent recovery requires knowledge of phrase structure, and hence calls for a parsing component.</Paragraph>
    <Paragraph position="1"> In this section, we show how to recover the antecedents given a parse tree, and how to incorporate information about EE-sites into the parser.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Antecedent recovery algorithm
</SectionTitle>
      <Paragraph position="0"> The main motivation for the introduction of gap+ variables is that they indicate a path from the EE to the antecedent. In case of a non-binary-branching grammar, however, this path only determines the node immediately dominating the antecedent, but does not indicate the child the EE should be co-indexed with. Moreover, a node might contain several gap+ variables, which further complicates antecedent recovery, even in the case of perfect trees.</Paragraph>
      <Paragraph position="1"> This calls for a sophisticated algorithm to recover antecedents.</Paragraph>
      <Paragraph position="2">  The algorithm, presented in Figure 3, runs after the best parse has been selected. It works in a bottom-up fashion, and for each empty node the main recursive function find antecedent is called separately (lines 1 and 2). At every call, the number of gap+ variables of type &amp;quot;gap&amp;quot; are calculated for the parent par of the current node node (p; line 6) and for all the children (ch; line 7). If the parent has at least as many unresolved gap+ variables as its children, we conclude that the current EE is re- null solved further up in the tree and call the same algorithm for the parent (line 20). If, however, the parent has fewer unresolved gaps (p a8 ch), the antecedent of the EE is among the children. Thus the algorithm attempts to find this antecedent (lines 1118). For an antecedent to be selected, the syntactic category must match, i.e. an NP-NP must resolve to a NP. The algorithm searches from left to right for a possible candidate, preferring non-adjuncts over adjuncts. The node found (if any) is returned as the antecedent for the EE. Finally, note that in line 9, we have to remove the threaded gap+ feature in order to avoid confusion if the same parent is visited again while resolving another EE.</Paragraph>
      <Paragraph position="3"> Although the algorithm is simple and works in a greedy manner, it does perform well. Tested on the gold standard trees containing the empty nodes without antecedent co-reference information, it is able to recover the antecedents with an F-score of 95% (c.f.</Paragraph>
      <Paragraph position="4"> Section 4.3).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Method
</SectionTitle>
      <Paragraph position="0"> Antecedent recovery is tested using two parsers: an unlexicalized PCFG (Dienes and Dubey, 2003) and a lexicalized parser with near state-of-the-art performance (Collins, 1999). Both parsers treat EEs as words. In order to recover antecedents, both were modified to thread gap+ variables in the nonterminals as described in Section 2.</Paragraph>
      <Paragraph position="1"> Each parser is evaluated in two cases: (i) an upper bound case which uses the perfect EEs of the tree-bank (henceforth PERFECT) and (ii) a case that uses EEs suggested by the finite-state mechanism (henceforth TAGGER). In the TAGGER case, the parser simply takes the hypotheses of the finite-state mechanism as true.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> We evaluate on all sentences in the test section of the treebank. As with trace detection, we use the measure introduced by Johnson (2002). This metric works by treating EEs and their antecedents as fourtuples, consisting of the type of the EE, its location, the type of its antecedent and the location(s) (beginning and end) of the antecedent. An antecedent is correctly recovered if all four values match the gold standard. We calculate the precision, recall, and Fscore; however for brevity's sake we only report the F-score for most experiments in this section.</Paragraph>
      <Paragraph position="1"> In addition to antecedent recovery, we also report parsing accuracy, using the bracketing F-Score, the combined measure of PARSEVAL-style labeled bracketing precision and recall (Magerman, 1995).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Results
</SectionTitle>
      <Paragraph position="0"> The results of the experiments are summarized in Table 3. UNLEX and LEX refer to the unlexicalized and lexicalized models, respectively. In the upper-bound case, PERFECT, the F-score for antecedent recovery is quite high in both the unlexicalized and lexicalized cases: 91.4% and 93.3%.</Paragraph>
      <Paragraph position="1">  ery results with the lexicalized parser and Johnson's (2002).</Paragraph>
      <Paragraph position="2"> Johnson (2002)'s metric includes EE without antecedents. To test how well the antecedent-detection algorithm works, it is useful, however, to count the results of only those EEs which have antecedents in the tree (NP-NP, PSEUDO attachments, and all WH traces). In these cases, the unlexicalized parser has an F-score of 70.4%, and the lexicalized parser 83.9%, both in the PERFECT case.</Paragraph>
      <Paragraph position="3"> In the TAGGER case, which is our main concern, the unlexicalized parser achieves an F-score of 72.6%, better than the 68.0% reported by Johnson (2002). The lexicalized parser outperforms both, yielding results of F-score of 74.6%.</Paragraph>
      <Paragraph position="4"> Table 4 gives a closer look at the antecedent recovery score for some common EE types using the lexicalized parser, also showing the results of Johnson (2002) for comparison.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Discussion
</SectionTitle>
      <Paragraph position="0"> The pre-processing system does quite well, managing an F-score 6.6% higher than the post-processing system of Johnson (2002). However, while the lexicalized parser performs better than the unlexicalized one, the difference is quite small: only 2%. This suggests that many of the remaining errors are actually in the pre-processor rather than in the parser.</Paragraph>
      <Paragraph position="1"> Two particular cases of interest are NP-NPs and PRO-NPs. In both cases, a NP is missing, often in a to-infinitival clause. The two are only distinguished by their antecedent: NP-NP has an antecedent in the tree, while PRO-NP has none. The lexicalized parser has, for most types of EEs, quite high antecedent detection results, but the difficulty in telling the difference between these two cases results in low F-scores for antecedent recovery of NP-NP and PRO-NP, despite the fact that they are among the most common EE types. Even though this is a problem, our system still does quite well: 70.4% for NP-NP, and 69.5% for PRO-NP compared to the 60.0% and 50.0% reported by Johnson (2002).</Paragraph>
      <Paragraph position="2"> Since it appears the pre-processor is the cause of most of the errors, in-processing with a state-of-the-art lexicalized parser might outperform the pre-processing approach. In the next section, we explore this possibility.</Paragraph>
      <Paragraph position="3"> 5 Detecting empty elements in the parser Having compared pre-processing to post-processing in the previous section, in this section, we consider the relative advantages of pre-processing as compared to detecting EEs while parsing, with both an unlexicalized and a lexicalized model.</Paragraph>
      <Paragraph position="4"> In making the comparison between detecting EEs during pre-processing versus parsing, we are not only concerned with the accuracy of parsing, EE detection and antecedent recovery, but also with the running time of the parsers. In particular, Dienes and Dubey (2003) found that detecting EEs is infeasible with an unlexicalized parser: the parser was slow and inaccurate at EE detection.</Paragraph>
      <Paragraph position="5"> Recall that the runtime of many parsing algorithms depends on the size of the grammar or the number of nonterminals. The unlexicalized CYK parser we use has a worst-case asymptotic runtime of Oa0 n3N3a1 where n is the number of words and N is the number of nonterminals. Collins (1999) reports a worst-case asymptotic runtime of Oa0 n5N3a1 for a lexicalized parser.</Paragraph>
      <Paragraph position="6"> The Oa0 N3a1 bound becomes important when the parser is to insert traces because there are more nonterminals. Three factors contribute to this larger nonterminal set: (i) nonterminals are augmented with EE types that contain the parent node of the EE (i.e. S may become Sa2a4a3a6a5a8a7a4a9 , Sa2a4a3a6a5a10a9a4a9 , etc.) (ii) we must include combinations of EEs as nonterminals may dominate more than one unbound EE (i.e.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sa7a4a9a11a5a8a7a4a9a6a12a8a2a4a3a6a5a8a7a4a9
</SectionTitle>
      <Paragraph position="0"> a1 and (iii) a single nonterminal may be repeated in the presence of co-ordination (i.e. Sa7a4a9a11a5a8a7a4a9 a12 a7a4a9a11a5a8a7a4a9 ). These three factors greatly increase the number of nonterminals, potentially reducing the efficiency of a parser that detects EEs. On the other hand, when EE-sites are pre-determined, the effect of the number of nonterminals on parsing speed is moot: the parser can ignore large parts of the grammar. null In this section, we empirically explore the relative advantages of pre-processing over in-processing, with respect to runtime efficiency and the accuracy of parsing and antecedent recovery.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Method
</SectionTitle>
      <Paragraph position="0"> As in Section 4, we use the unlexicalized parser from Dienes and Dubey (2003), and as a lexicalized parser, an extension of Model 3 of Collins (1999).</Paragraph>
      <Paragraph position="1"> While Model 3 inserts WH-NP traces, it makes some assumptions that preclude it from being used here directly:  (i) it cannot handle multiple types of EEs; (ii) it does not allow multiple instances of EEs at a node; (iii) it expects all EEs to be complements, though some are not (e.g. WH-ADVP); (iv) it expects all EEs to have antecedents, though some do not (e.g. PRO-NP); (v) it cannot model EEs with dependents, for example COMP-. . . .</Paragraph>
      <Paragraph position="2">  Hence, Model 3 must be generalized to other types of discontinuities. In order to handle the first four problems, we propose generating 'gapcategorization' frames in the same way as subcategorization frames are used in the original model. We do not offer a solution to the final problem, as the syntactic structure (usually the unary production SBAR a1 S) will identify these cases.</Paragraph>
      <Paragraph position="3"> After calculating the probability of the head (with its gaps), the left and right gapcat frame are generated independently of each other (and of the subcat frames). For example, the probability for the rule:  Generating the actual EE is done in a similar fashion: the EE cancels the corresponding 'gapcat' requirement. If it is a complement (e.g. WH-NP), it also removes the corresponding element from the subcat frame. The original parsing algorithm was modified to accommodate 'gapcat' requirements and generate multiple types of EEs.</Paragraph>
      <Paragraph position="4"> We compare the parsing performance of the two parsers in four cases: the NOTRACE model which removes all traces from the test and training data, the TAGGER model of Section 4, and two cases where the parser inserts EEs (we will collectively refer to these cases as the INSERT models). In order to show the effects of increasing the size of nonterminal vocabulary, the first INSERT model only considers one EE type, WH-NP while the second (henceforth PRO&amp;WH) considers all WH traces as well as NP-NP and PRO-NP discontinuities.</Paragraph>
    </Section>
    <Section position="8" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> The results of the unlexicalized and lexicalized experiments are summarized in Tables 5 and Table 6, respectively. The tables compare relative parsing time (slowdown with respect to the NOTRACE model), and in the lexicalized case, PARSEVAL-style bracketing scores. However, in the case of the unlexicalized model, the increasing number of  missed parses precludes straightforward comparison of bracketing scores, therefore we report the percentage of sentences where the parser fails. In the case of the lexicalized parser, less than 1% of the parses are missed, hence the comparisons are reliable. Finally, we compare EE detection and antecedent recovery F-scores of the TAGGER and the PRO&amp;WH models for the overlapping EE types (Table 7).</Paragraph>
    </Section>
    <Section position="9" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Discussion
</SectionTitle>
      <Paragraph position="0"> As noted by Dienes and Dubey (2003), unlexicalized parsing with EEs does not seem to be viable without pre-processing. However, the lexicalized parser is competitive with the pre-processing approach. null As for the bracketing scores, there are two interesting results. First, lexicalized models which handle EEs have lower bracketing scores than the NOTRACE model. Indeed, as the number of EEs increases, so does the number of nonterminals, which results in increasingly severe sparse data problem.</Paragraph>
      <Paragraph position="1"> Consequently, there is a trade-off between finding local phrase structure and long-distance dependencies. null Second, comparing the TAGGER and the PRO&amp;WH models, we find that the bracketing results are nearly identical. Nonetheless, the PRO&amp;WH model inserting EEs can match neither the accuracy for antecedent recovery nor the time efficiency of the pre-processing approach. Thus, the results show that treating EE-detection as a pre-processing step is beneficial to both to antecedent recovery accuracy and to parsing efficiency.</Paragraph>
      <Paragraph position="2"> Nevertheless, pre-processing is not necessarily the only useful strategy for trace detection. Indeed, by taking advantage of the insights that make the finite-state and lexicalized parsing models successful, it may be possible to generalize the results to other strategies as well. There are two key observations of importance here.</Paragraph>
      <Paragraph position="3"> The first observation is that lexicalization is very important for detecting traces, not just for the lexicalized parser, but, as discussed in Section 3, for the trace-tagger as well. The two models may contain overlapping information: in many cases, the lexical cue corresponds to the immediate head-word the EE depends on. However, other surrounding words (which frequently correspond to the head-word of grandparent of the empty node) often carry important information, especially for distinguishing NP-NP and PRO-NP nodes.</Paragraph>
      <Paragraph position="4"> Second, local information (i.e. a window of five words) proves to be informative for the task. This explains why the finite-state tagger is more accurate than the parser: this window always crosses a phrase boundary, and the parser cannot consider the whole window.</Paragraph>
      <Paragraph position="5"> These two observations give a set of features that seem to be useful for EE detection. We conjecture that a parser that takes advantage of these features might be more accurate in detecting EEs while parsing than the parsers presented here. Apart from the pre-processing approach presented here, there are a number of ways these features could be used: 1. in a pre-processing system that only detects EEs, as we have done here; 2. as part of a larger syntactic pre-processing system, such as supertagging (Joshi and Bangalore, 1994); 3. with a more informative beam search (Charniak et al., 1998); 4. or directly integrated into the parsing mechanism, for example by combining the finite-state and the parsing probability models.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML