File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1055_metho.xml

Size: 19,614 bytes

Last Modified: 2025-10-06 14:08:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1055">
  <Title>Deep Syntactic Processing by Combining Shallow Methods</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Parsing with empty elements
</SectionTitle>
    <Paragraph position="0"> The present section explores whether an unlexicalized PCFG parser can handle non-local dependencies: first, is it able to detect EEs and, second, can it find their antecedents? The answer to the first question turns out to be negative: due to efficiency reasons and the inappropriateness of the model, detecting all types of EEs is not feasible within the parser. Antecedents, however, can be reliably recovered provided a parser has perfect knowledge about EEs occurring in the input. This shows that the main bottleneck is detecting the EEs and not finding their antecedents. In the following section, therefore, we explore how we can provide the parser with information about EE sites in the current sentence without 1This technique fails for 82 sentences of the treebank where the antecedent does not c-command the corresponding EE.</Paragraph>
    <Paragraph position="1"> relying on phrase structure information.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Method
</SectionTitle>
      <Paragraph position="0"> There are three modifications required to allow a parser to detect EEs and resolve antecedents. First, it should be able to insert empty nodes. Second, it must thread the gap+ variables to the parent node of the antecedent. Knowing this node is not enough, though. Since the Penn Treebank grammar is not binary-branching, the final task is to decide which child of this node is the actual antecedent.</Paragraph>
      <Paragraph position="1"> The first two modifications are not difficult conceptually. A bottom-up parser can be easily modified to insert empty elements (c.f.</Paragraph>
      <Paragraph position="2"> Dienes and Dubey (2003)). Likewise, the changes required to include gap+ categories are not complicated: we simply add the gap+ features to the non-terminal category labels.</Paragraph>
      <Paragraph position="3"> The final and perhaps most important concern with developing a gap-threading parser is to ensure it is possible to choose the correct child as the antecedent of an EE. To achieve this task, we employ the algorithm presented in Figure 2. At any node in the tree where the children, all together, have more gap+ features activated than the parent, the algorithm deduces that a gap+ must have an antecedent. It then picks a child as the antecedent and recursively removes the gap+ feature corresponding to its EE from the non-terminal labels. The algorithm has a shortcoming, though: it cannot reliably handle cases when the antecedent does not c-command its EE. This mostly happens with PSEUDOs (pseudo-attachments), where the algorithm gives up and (wrongly) assumes they have no antecedent.</Paragraph>
      <Paragraph position="4"> Given the perfect trees of the development set, the antecedent recovery algorithm finds the correct antecedent with 95% accuracy, rising to 98% if PSEUDOs are excluded. Most of the remaining mistakes are caused either by annotation errors, or by binding NP-traces (NP-NP) to adjunct NPs, as opposed to subject NPs.</Paragraph>
      <Paragraph position="5"> The parsing experiments are carried out with an unlexicalized PCFG augmented with the antecedent recovery algorithm. We use an unlexicalized model to emphasize the point that even a simple model detects long distance dependencies successfully. The parser uses beam thresholding (Goodman, 1998) to for a tree T, iterate over nodes bottom-up for a node with rule P a0 C0</Paragraph>
      <Paragraph position="7"> foreach EE of type e in M a4 N pick a j such that e allows Cj as an antecedent pick a k such that k a5a6 j and Ck dominates an EE of type e if no such j or k exist,  ensure efficient parsing. PCFG probabilities are calculated in the standard way (Charniak, 1993). In order to keep the number of independently tunable parameters low, no smoothing is used. The parser is tested under two different conditions. First, to assess the upper bound an EEdetecting unlexicalized PCFG can achieve, the input of the parser contains the empty elements as separate words (PERFECT). Second, we let the parser introduce the EEs itself (INSERT).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> We evaluate on all sentences in the test section of the treebank. As our interest lies in trace detection and antecedent recovery, we adopt the evaluation measures introduced by Johnson (2002). An EE is correctly detected if our model gives it the correct label as well as the correct position (the words before and after it). When evaluating antecedent recovery, the EEs are regarded as four-tuples, consisting of the type of the EE, its location, the type of its antecedent and the location(s) (beginning and end) of the antecedent. An antecedent is correctly recovered if all four values match the gold standard. The precision, recall, and the combined F-score is presented for each experiment. Missed parses are ignored for evaluation purposes.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> The main results for the two conditions are summarized in Table 2. In the INSERT case, the parser detects empty elements with precision 64.7%, recall 40.3% and F-Score 49.7%. It recovers antecedents  times, and missed parses for the parser with overall precision 55.7%, recall 35.0% and F-score 43.0%. With a beam width of 1000, about half of the parses were missed, and successful parses take, on average, 21 seconds per sentence and enumerate 1.7 million edges. Increasing the beam size to 40000 decreases the number of missed parses marginally, while parsing time increases to nearly two minutes per sentence, with 2.9 million edges enumerated.</Paragraph>
      <Paragraph position="1"> In the PERFECT case, when the sites of the empty elements are known before parsing, only about 1.6% of the parses are missed and average parsing time goes down to 2a0 5 seconds per sentence. More importantly, the overall precision and recall of antecedent recovery is 91.4%.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Discussion
</SectionTitle>
      <Paragraph position="0"> The result of the experiment where the parser is to detect long-distance dependencies is negative. The parser misses too many parses, regardless of the beam size. This cannot be due to the lack of smoothing: the model with perfect information about the EE-sites does not run into the same problem. Hence, the edges necessary to construct the required parse are available but, in the INSERT case, the beam search loses them due to unwanted local edges having a higher probability. Doing an exhaustive search might help in principle, but it is infeasible in practice. Clearly, the problem is with the parsing model: an unlexicalized PCFG parser is not able to detect where EEs can occur, hence necessary edges get low probability and are, thus, filtered out.</Paragraph>
      <Paragraph position="1"> The most interesting result, though, is the difference in speed and in antecedent recovery accuracy between the parser that inserts traces, and the parser which uses perfect information from the tree-bank about the sites of EEs. Thus, the question  naturally arises: could EEs be detected before parsing? The benefit would be two-fold: EEs might be found more reliably with a different module, and the parser would be fast and accurate in recovering antecedents. In the next section we show that it is indeed possible to detect EEs without explicit knowledge of phrase structure, using a simple finite-state tagger.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Detecting empty elements
</SectionTitle>
    <Paragraph position="0"> This section shows that EEs can be detected fairly reliably before parsing, i.e. without using phrase structure information. Specifically, we develop a finite-state tagger which inserts EEs at the appropriate sites. It is, however, unable to find the antecedents for the EEs; therefore, in the next section, we combine the tagger with the PCFG parser to recover the antecedents.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Method
</SectionTitle>
      <Paragraph position="0"> Detecting empty elements can be regarded as a simple tagging task: we tag words according to the existence and type of empty elements preceding them.</Paragraph>
      <Paragraph position="1"> For example, the word Sasha in the sentence Sam said COMP-SBAR Sasha snores.</Paragraph>
      <Paragraph position="2"> will get the tag EE=COMP-SBAR, whereas the word Sam is tagged with EE=* expressing the lack of an EE immediately preceding it. If a word is preceded by more than one EE, such as to in the following example, it is tagged with the concatenation of the two EEs, i.e., EE=COMP-WHNP PRO-NP.</Paragraph>
      <Paragraph position="3"> It would have been too late COMP-WHNP PRO-NP to think about on Friday.</Paragraph>
      <Paragraph position="4">  Although this approach is closely related to POStagging, there are certain differences which make this task more difficult. Despite the smaller tagset, the data exhibits extreme sparseness: even though more than 50% of the sentences in the Penn Tree-bank contain some EEs, the actual number of EEs is very small. In Section 0 of the WSJ corpus, out of the 46451 tokens only 3056 are preceded by one or more EEs, that is, approximately 93.5% of the words are tagged with the EE=* tag.</Paragraph>
      <Paragraph position="5"> The other main difference is the apparently non-local nature of the problem, which motivates our choice of a Maximum Entropy (ME) model for the tagging task (Berger et al., 1996). ME allows the flexible combination of different sources of information, i.e., local and long-distance cues characterizing possible sites for EEs. In the ME framework, linguistic cues are represented by (binary-valued) features ( fi), the relative importance (weight, li) of which is determined by an iterative training algorithm. The weighted linear combination of the features amount to the log-probability of the label (l) given the con-</Paragraph>
      <Paragraph position="7"> where Za8 ca10 is a context-dependent normalizing factor to ensure that pa8 la9 ca10 be a proper probability distribution. We determine weights for the features with a modified version of the Generative Iterative Scaling algorithm (Curran and Clark, 2003).</Paragraph>
      <Paragraph position="8"> Templates for local features are similar to the ones employed by Ratnaparkhi (1996) for POS-tagging (Table 3), though as our input already includes POStags, we can make use of part-of-speech information as well. Long-distance features are simple hand-written regular expressions matching possible sites for EEs (Table 4). Features and labels occurring less than 10 times in the training corpus are ignored.</Paragraph>
      <Paragraph position="9"> Since our main aim is to show that finding empty elements can be done fairly accurately without using a parser, the input to the tagger is a POS-tagged corpus, containing no syntactic information. The best label-sequence is approximated by a bigram Viterbi-search algorithm, augmented with variable width beam-search.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> The results of the EE-detection experiment are summarized in Table 5. The overall unlabeled F-score is 85a0 3%, whereas the labeled F-score is 79a0 1%, which amounts to 97a0 9% word-level tagging accuracy.</Paragraph>
    <Paragraph position="1"> For straightforward comparison with Johnson's results, we must conflate the categories PRO-NP and NP-NP. If the trace detector does not need to differentiate between these two categories, a distinction that is indeed important for semantic analysis, the overall labeled F-score increases to 83a0 0%, which outperforms Johnson's approach by 4%.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Discussion
</SectionTitle>
      <Paragraph position="0"> The success of the trace detector is surprising, especially if compared to Johnson's algorithm which uses the output of a parser. The tagger can reliably detect extraction sites without explicit knowledge of the phrase structure. This shows that, in English, extraction can only occur at well-defined sites, where local cues are generally strong.</Paragraph>
      <Paragraph position="1"> Indeed, the strength of the model lies in detecting such sites (empty units, UNIT; NP traces, NP-NP) or where clear-cut long-distance cues exist (WH-S, COMP-SBAR). The accuracy of detecting uncon- null parison with Johnson (2002) (where applicable).</Paragraph>
      <Paragraph position="2"> trolled PROs (PRO-NP) is rather low, since it is a difficult task to tell them apart from NP traces: they are confused in 10 a0 15% of the cases. Furthermore, the model is unable to capture for. . . to+INF constructions if the noun-phrase is long.</Paragraph>
      <Paragraph position="3"> The precision of detecting long-distance NP extraction (WH-NP) is also high, but recall is lower: in general, the model finds extracted NPs with overt complementizers. Detection of null WHcomplementizers (COMP-WHNP), however, is fairly inaccurate (48a0 8% F-score), since finding it and the corresponding WH-NP requires information about the transitivity of the verb. The performance of the model is also low (59a0 5%) in detecting movement sites for extracted WH-adverbs (WH-ADVP) despite the presence of unambiguous cues (where, how, etc.</Paragraph>
      <Paragraph position="4"> starting the subordinate clause). The difficulty of the task lies in finding the correct verb-phrase as well as the end of the verb-phrase the constituent is extracted from without knowing phrase boundaries.</Paragraph>
      <Paragraph position="5"> One important limitation of the shallow approach described here is its inability to find the antecedents of the EEs, which clearly requires knowledge of phrase structure. In the next section, we show that the shallow trace detector and the unlexicalized PCFG parser can be coupled to efficiently and successfully tackle antecedent recovery.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Combining the models
</SectionTitle>
    <Paragraph position="0"> In Section 3, we found that parsing with EEs is only feasible if the parser knows the location of EEs before parsing. In Section 4, we presented a finite-state tagger which detects these sites before parsing takes place. In this section, we validate the two-step approach, by applying the parser to the output of the trace tagger, and comparing the antecedent recovery accuracy to Johnson (2002).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Method
</SectionTitle>
      <Paragraph position="0"> Theoretically, the 'best' way to combine the trace tagger and the parsing algorithm would be to build a unified probabilistic model. However, the nature of the models are quite different: the finite-state model is conditional, taking the words as given. The parsing model, on the other hand, is generative, treating the words as an unlikely event. There is a reasonable basis for building the probability models in different ways. Most of the tags emitted by the EE tagger are just EE=*, which would defeat generative models by making the 'hidden' state uninformative. Conditional parsing algorithms do exist, but they are difficult to train using large corpora (Johnson, 2001). However, we show that it is quite effective if the parser simply treats the output of the tagger as a certainty.</Paragraph>
      <Paragraph position="1"> Given this combination method, there still are two interesting variations: we may use only the EEs proposed by the tagger (henceforth the NOINSERT model), or we may allow the parser to insert even more EEs (henceforth the INSERT model). In both cases, EEs outputted by the tagger are treated as separate words, as in the PERFECT model of Section 3.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> The NOINSERT model did better at antecedent de- null combined NOINSERT model and comparison with Johnson (2002).</Paragraph>
      <Paragraph position="1"> NOINSERT model was also faster, taking on average 2.7 seconds per sentence and enumerating about 160,000 edges whereas the INSERT model took 25 seconds on average and enumerated 2 million edges. The coverage of the NOINSERT model was higher than that of the INSERT model, missing 2.4% of all parses versus 5.3% for the INSERT model.</Paragraph>
      <Paragraph position="2"> Comparing our results to Johnson (2002), we find that the NOINSERT model outperforms that of Johnson by 4.6% (see Table 7). The strength of this system lies in its ability to tell unbound PROs and bound NP-NP traces apart.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Discussion
</SectionTitle>
      <Paragraph position="0"> Combining the finite-state tagger with the parser seems to be invaluable for EE detection and antecedent recovery. Paradoxically, taking the combination to the extreme by allowing both the parser and the tagger to insert EEs performed worse.</Paragraph>
      <Paragraph position="1"> While the INSERT model here did have wider coverage than the parser in Section 3, it seems the real benefit of using the combined approach is to let the simple model reduce the search space of the more complicated parsing model. This search space reduction works because the shallow finite-state method takes information about adjacent words into account, whereas the context-free parser does not, since a phrase boundary might separate them.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> Excluding Johnson (2002)'s pattern-matching algorithm, most recent work on finding headdependencies with statistical parser has used statistical versions of deep grammar formalisms, such as CCG (Clark et al., 2002) or LFG (Riezler et al., 2002). While these systems should, in theory, be able to handle discontinuities accurately, there has not yet been a study on how these systems handle such phenomena overall.</Paragraph>
    <Paragraph position="1"> The tagger presented here is not the first one proposed to recover syntactic information deeper than part-of-speech tags. For example, supertagging (Joshi and Bangalore, 1994) also aims to do more meaningful syntactic pre-processing. Unlike supertagging, our approach only focuses on detecting EEs.</Paragraph>
    <Paragraph position="2"> The idea of threading EEs to their antecedents in a stochastic parser was proposed by Collins (1997), following the GPSG tradition (Gazdar et al., 1985).</Paragraph>
    <Paragraph position="3"> However, we extend it to capture all types of EEs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML