XML Viewer - p06-2009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2009_metho.xml
Size: 24,955 bytes
Last Modified: 2025-10-06 14:10:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2009">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Pipeline Framework for Dependency Parsing</Title>
  <Section position="4" start_page="65" end_page="68" type="metho">
    <SectionTitle>
2 Ef cient Dependency Parsing
</SectionTitle>
    <Paragraph position="0"> This section describes our DP algorithm and justi es its advantages as a pipeline model. We pro- null pose an improved pipeline framework based on the mentioned principles.</Paragraph>
    <Paragraph position="1"> For many languages such as English, Chinese and Japanese (with a few exceptions), projective dependency trees (that is, DPs without edge crossings) are suf cient to analyze most sentences. Our work is therefore concerned only with projective trees, which we de ne below.</Paragraph>
    <Paragraph position="2"> For words x, y in the sentence T we introduce the following notations:</Paragraph>
    <Paragraph position="4"> De nition 1 (Projective Language) (Nivre, 2003) [?]a, b, c [?] T, a - b and a &lt; c &lt; b imply that a -[?] c or b -[?] c.</Paragraph>
    <Section position="1" start_page="66" end_page="67" type="sub_section">
      <SectionTitle>
2.1 A Pipeline DP Algorithm
</SectionTitle>
      <Paragraph position="0"> Our parsing algorithm is a modi ed shift-reduce parser that makes use of the actions described below and applies them in a left to right manner on consecutive pairs of words (a, b) (a &lt; b) in the sentence. This is a bottom-up approach that uses machine learning algorithms to learn the parsing decisions (actions) between consecutive words in the sentences. The basic actions used in this model, as in (Yamada and Matsumoto, 2003), are: Shift: there is no relation between a and b, or the action is deferred because the relationship between a and b cannot be determined at this point.</Paragraph>
      <Paragraph position="1"> Right: b is the parent of a, Left: a is the parent of b.</Paragraph>
      <Paragraph position="2"> This is a true pipeline approach in that the classi ers are trained on individual decisions rather than on the overall quality of the parsing, and chained to yield the global structure. And, clearly, decisions make with respect to a pair of words affect what is considered next by the algorithm.</Paragraph>
      <Paragraph position="3"> In order to complete the description of the algorithm we need to describe which edge to consider once an action is taken. We describe it via the notion of the focus point: when the algorithm considers the pair (a, b), a &lt; b, we call the word a the current focus point.</Paragraph>
      <Paragraph position="4"> Next we describe several policies for determining the focus point of the algorithm following an action. We note that, with a few exceptions, determining the focus point does not affect the correctness of the algorithm. It is easy to show that for (almost) any focus point chosen, if the correct action is selected for the corresponding edge, the algorithm will eventually yield the correct tree (but may require multiple cycles through the sentence).</Paragraph>
      <Paragraph position="5"> In practice, the actions selected are noisy, and a wasteful focus point policy will result in a large number of actions, and thus in error accumulation.</Paragraph>
      <Paragraph position="6"> To minimize the number of actions taken, we want to nd a good focus point placement policy.</Paragraph>
      <Paragraph position="7"> After S, the focus point always moves one word to the right. After L or R there are there natural placement policies to consider: Start Over: Move focus to the rst word in T.</Paragraph>
      <Paragraph position="8"> Stay: Move focus to the next word to the right.</Paragraph>
      <Paragraph position="9"> That is, for T = (a, b, c), and focus being a, an L action will result is the focus being a, while R action results in the focus being b.</Paragraph>
      <Paragraph position="10"> Step Back: The focus moves to the previous word (on the left). That is, for T = (a, b, c), and focus being b, in both cases, a will be the focus point.</Paragraph>
      <Paragraph position="11"> In practice, different placement policies have a signi cant effect on the number of pairs considered by the algorithm and, therefore, on the nal accuracy1. The following analysis justi es the Step Back policy. We claim that if Step Back is used, the algorithm will not waste any action.</Paragraph>
      <Paragraph position="12"> Thus, it achieves the goal of minimizing the number of actions in pipeline algorithms. Notice that using this policy, when L is taken, the pair (a, b) is reconsidered, but with new information, since now it is known that c is the child of b. Although this seems wasteful, we will show this is a necessary movement to reduce the number of actions.</Paragraph>
      <Paragraph position="13"> As mentioned above, each of these policies yields the correct tree. Table 1 compares the three policies in terms of the number of actions required to build a tree.</Paragraph>
      <Paragraph position="14">  all the trees for the sentences in section 23 of Penn Treebank (Marcus et al., 1993) as a function of the focus point placement policy. The statistics are taken with the correct (gold-standard) actions.</Paragraph>
      <Paragraph position="15"> It is clear from Table 1 that the policies result 1Note that (Yamada and Matsumoto, 2003) mention that they move the focus point back after R, but do not state what they do after executing L actions, and why. (Yamada, 2006) indicates that they also move focus point back after L.  Algorithm 2 Pseudo Code of the dependency parsing algorithm. getFeatures extracts the features describing the word pair currently considered; getAction determines the appropriate action for the pair; assignParent assigns a parent for the child word based on the action; and deleteWord deletes the child word in T at the focus once the action is taken.</Paragraph>
      <Paragraph position="16"> Let t represents for a word token</Paragraph>
      <Paragraph position="18"> in very different number of actions and that Step Back is the best choice. Note that, since the actions are the gold-standard actions, the policy affects only the number of S actions used, and not the L and R actions, which are a direct function of the correct tree. The number of required actions in the testing stage shows the same trend and the Step Back also gives the best dependency accuracy. Algorithm 2 depicts the parsing algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="68" type="sub_section">
      <SectionTitle>
2.2 Correctness and Pipeline Properties
</SectionTitle>
      <Paragraph position="0"> We can prove two properties of our algorithm.</Paragraph>
      <Paragraph position="1"> First we show that the algorithm builds the dependency tree in only one pass over the sentence.</Paragraph>
      <Paragraph position="2"> Then, we show that the algorithm does not waste actions in the sense that it never considers a word pair twice in the same situation. Consequently, this shows that under the assumption of a perfect action predictor, our algorithm makes the smallest possible number of actions, among all algorithms that build a tree sequentially in one pass.</Paragraph>
      <Paragraph position="3"> Note that this may not be true if the action classi er is not perfect, and one can contrive examples in which an algorithm that makes several passes on a sentence can actually make fewer actions than a single pass algorithm. In practice, however, as our experimental data shows, this is unlikely.</Paragraph>
      <Paragraph position="4"> Lemma 1 A dependency parsing algorithm that uses the Step Back policy completes the tree when it reaches the end of the sentence for the rst time.</Paragraph>
      <Paragraph position="5"> In order to prove the algorithm we need the following de nition. We call a pair of words (a, b) a free pair if and only if there is a relation between a and b and the algorithm can perform L or R actions on that pair when it is considered. Formally, De nition 2 (free pair) A pair (a, b) considered by the algorithm is a free pair, if it satis es the following conditions:  1. a - b 2. a, b are consecutive in T (not necessary in the original sentence).</Paragraph>
      <Paragraph position="6"> 3. No other word in T is the child of a or b. (a  and b are now part of a complete subtree.) Proof. : It is easy to see that there is at least one free pair in T, with |T |&gt; 1. The reason is that if no such pair exists, there must be three words {a, b, c} s.t. a - b, a &lt; c &lt; b and !(a - c [?] b - c). However, this violates the properties of a projective language.</Paragraph>
      <Paragraph position="7"> Assume {a, b, d} are three consecutive words in T. Now, we claim that when using Step Back, the focus point is always to the left of all free pairs in T. This is clearly true when the algorithm starts. Assume that (a, b) is the rst free pair in T and let c be just to the left of a and b. Then, the algorithm will not make a L or R action before the focus point meets (a, b), and will make one of these actions then. It's possible that (c, a [?] b) becomes a free pair after removing a or b in T so we need to move the focus point back. However, we also know that there is no free pair to the left of c. Therefore, during the algorithm, the focus point will always remain to the left of all free pairs. So, when we reach the end of the sentence, every free pair in the sentence has been taken care of, and the sentence has been completely parsed. a50 Lemma 2 All actions made by a dependency parsing algorithm that uses the Step Back policy are necessary.</Paragraph>
      <Paragraph position="8"> Proof. : We will show that a pair (a, b) will never be considered again given the same situation, that is, when there is no additional information about relations a or b participate in. Note that if R or  L is taken, either a or b will become a child word and be eliminate from further consideration by the algorithm. Therefore, if the action taken on (a, b) is R or L, it will never be considered again.</Paragraph>
      <Paragraph position="9"> Assume that the action taken is S, and, w.l.o.g.</Paragraph>
      <Paragraph position="10"> that this is the rightmost S action taken before a non-S action happens. Note that it is possible that there is a relation between a and b, but we cannot perform R or L now. Therefore, we should consider (a, b) again only if a child of a or b has changed. When Step Back is used, we will consider (a, b) again only if the next action is L. (If next action is R, b will be eliminated.) This is true because the focus point will move back after performing L, which implies that b has a new child so we are indeed in a new situation. Since, from Lemma 1, the algorithm only requires one round.</Paragraph>
      <Paragraph position="11"> we therefore consider (a, b) again only if the situation has changed. a50</Paragraph>
    </Section>
    <Section position="3" start_page="68" end_page="68" type="sub_section">
      <SectionTitle>
2.3 Improving the Parsing Action Set
</SectionTitle>
      <Paragraph position="0"> In order to improve the accuracy of the action predictors, we suggest a new (hierarchical) set of actions: Shift, Left, Right, WaitLeft, WaitRight. We believe that predicting these is easier due to ner granularity the S action is broken to sub-actions in a natural way.</Paragraph>
      <Paragraph position="1"> WaitLeft: a &lt; b. a is the parent of b, but it's possible that b is a parent of other nodes. Action is deferred. If we perform Left instead, the child of b can not nd its parents later.</Paragraph>
      <Paragraph position="2"> WaitRight: a &lt; b. b is the parent of a, but it's possible that a is a parent of other nodes. Similar to WL, action is deferred.</Paragraph>
      <Paragraph position="3"> Thus, we also change the algorithm to perform S only if there is no relationship between a and b2. The new set of actions is shown to better support our parsing algorithm, when tested on different placement policies. When WaitLeft or WaitRight is performed, the focus will move to the next word. It is very interesting to notice that WaitRight is not needed in projective languages if Step Back is used. This give us another strong reason to use Step Back, since the classi cation becomes more accurate a more natural class of actions, with a smaller number of candidate actions.</Paragraph>
      <Paragraph position="4"> Once the parsing algorithm, along with the focus point policy, is determined, we can train the 2Interestingly, (Yamada and Matsumoto, 2003) mention the possibility of an additional single Wait action, but do not add it to the model.</Paragraph>
      <Paragraph position="5"> action classi ers. Given an annotated corpus, the parsing algorithm is used to determine the action taken for each consecutive pair; this is used to train a classi er to predict one of the ve actions. The details of the classi er and the feature used are given in Section 4.</Paragraph>
      <Paragraph position="6"> When the learned model is evaluated on new data, the sentence is processed left to right and the parsing algorithm, along with the action classi er, are used to produce the dependency tree. The evaluation process is somewhat more involved, since the action classi er is not used as is, but rather via a look ahead inference step described next.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="68" end_page="69" type="metho">
    <SectionTitle>
3 A Pipeline Model with Look Ahead
</SectionTitle>
    <Paragraph position="0"> The advantage of a pipeline model is that it can use more information, based on the outcomes of previous predictions. As discussed earlier, this may result in accumulating error. The importance of having a reliable action predictor in a pipeline model motivates the following approach. We devise a look ahead algorithm and use it as a look ahead policy, when determining the predicted action.</Paragraph>
    <Paragraph position="1"> This approach can be used in any pipeline model but we illustrate it below in the context of our dependency parser.</Paragraph>
    <Paragraph position="2"> The following example illustrates a situation in which an early mistake in predicting an action causes a chain reaction and results in further mistakes. This stresses the importance of correct early decisions, and motivates our look ahead policy.</Paragraph>
    <Paragraph position="3"> Let (w, x, y, z) be a sentence of four words, and assume that the correct dependency relations are as shown in the top part of Figure 1. If the system mistakenly predicts that x is a child of w before y and z becomes x's children, we can only consider the relationship between w and y in the next stage.</Paragraph>
    <Paragraph position="4"> Consequently, we will never nd the correct parent for y and z. The previous prediction error propagates and impacts future predictions. On the other hand, if the algorithm makes a correct prediction, in the next stage, we do not need to consider w and y. As shown, getting useful rather than misleading information in a pipeline model, requires correct early predictions. Therefore, it is necessary to utilize some inference framework to that may help resolving the error accumulation problem.</Paragraph>
    <Paragraph position="5"> In order to improve the accuracy of the action prediction, we might want to examine all possible combinations of action sequences and choose the one that maximizes some score. It is clearly in-</Paragraph>
    <Paragraph position="7"> tions between w, x, y and z. Bottom gure: if the algorithm mistakenly decides that x is a child of w before deciding that y and z are x's children, we cannot nd the correct parent for y and z.</Paragraph>
    <Paragraph position="8"> tractable to nd the global optimal prediction sequences in a pipeline model of the depth we consider. Therefore, we use a look ahead strategy, implemented via a local search framework, which uses additional information but is still tractable.</Paragraph>
    <Paragraph position="9"> The local search algorithm is presented in Algorithm 3. The algorithm accepts three parameters, model, depth and State. We assume a classi er that can give a con dence in its prediction. This is represented here by model.</Paragraph>
    <Paragraph position="10"> As our learning algorithm we use a regularized variation of the perceptron update rule, as incorporated in SNoW (Roth, 1998; Carlson et al., 1999), a multi-class classi er that is tailored for large scale learning tasks and has been used successfully in a large number of NLP tasks (e.g., (Punyakanok et al., 2005)). SNoW uses softmax over the raw activation values as its con dence measure, which can be shown to produce a reliable approximation of the labels' conditional probabilities.</Paragraph>
    <Paragraph position="11"> The parameter depth is to determine the depth of the search procedure. State encodes the con guration of the environment (in the context of the dependency parsing this includes the sentence, the focus point and the current parent and children for each word). Note that State changes when a prediction is made and that the features extracted for the action classi er also depend on State.</Paragraph>
    <Paragraph position="12"> The search algorithm will perform a search of length depth. Additive scoring is used to score the sequence, and the rst action in this sequence is selected and performed. Then, the State is updated, the new features for the action classi ers are computed and search is called again.</Paragraph>
    <Paragraph position="13"> One interesting property of this framework is that it allows that use of future information in addition to past information. The pipeline model naturally allows access to all the past information.</Paragraph>
    <Paragraph position="14"> Algorithm 3 Pseudo code for the look ahead algorithm. y represents a action sequence. The function search considers all possible action sequences with |depth |actions and returns the sequence with the highest score.</Paragraph>
    <Paragraph position="16"> Since the algorithm uses a look ahead policy, it also uses future predictions. The signi cance of this becomes clear in Section 4.</Paragraph>
    <Paragraph position="17"> There are several parameters, in addition to depth that can be used to improve the ef ciency of the framework. For example, given that the action predictor is a multi-class classi er, we do not need to consider all future possibilities in order to decide the current action. For example, in our experiments, we only consider two actions with highest score at each level (which was shown to produce almost the same accuracy as considering all four actions).</Paragraph>
  </Section>
  <Section position="6" start_page="69" end_page="70" type="metho">
    <SectionTitle>
4 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> We use the standard corpus for this task, the Penn Treebank (Marcus et al., 1993). The training set consists of sections 02 to 21 and the testing set is section 23. The POS tags for the evaluation data sets were provided by the tagger of (Toutanova et al., 2003) (which has an accuracy of 97.2% section</Paragraph>
    <Section position="1" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
23 of the Penn Treebank).
4.1 Features for Action Classi cation
</SectionTitle>
      <Paragraph position="0"> For each word pair (w1, w2) we use the words, their POS tags and also these features of the children of w1 and w2. We also include the lexicon and POS tags of 2 words before w1 and 4 words after w2 (as in (Yamada and Matsumoto, 2003)).</Paragraph>
      <Paragraph position="1"> The key additional feature we use, relative to (Yamada and Matsumoto, 2003), is that we include the previous predicted action as a feature. We also add conjunctions of above features to ensure expressiveness of the model. (Yamada and Matsumoto, 2003) makes use of polynomial kernels of degree 2 which is equivalent to using even more conjunctive features. Overall, the average number of active features in an example is about 50.</Paragraph>
    </Section>
    <Section position="2" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
4.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> We use the same evaluation metrics as in (McDonald et al., 2005). Dependency accuracy (DA) is the proportion of non-root words that are assigned the correct head. Complete accuracy (CA) indicates the fraction of sentences that have a complete correct analysis. We also measure that root accuracy (RA) and leaf accuracy (LA), as in (Yamada and Matsumoto, 2003). When evaluating the result, we exclude the punctuation marks, as done in (Mc-Donald et al., 2005) and (Yamada and Matsumoto, 2003).</Paragraph>
    </Section>
    <Section position="3" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> We present the results of several of the experiments that were intended to help us analyze and understand several of the design decisions in our pipeline algorithm.</Paragraph>
      <Paragraph position="1"> To see the effect of the additional action, we present in Table 2 a comparison between a system that does not have the WaitLeft action (similar to the (Yamada and Matsumoto, 2003) approach) with one that does. In both cases, we do not use the look ahead procedure. Note that, as stated above, the action WaitRight is never needed for our parsing algorithm. It is clear that adding WaitLeft increases the accuracy signi cantly.</Paragraph>
      <Paragraph position="2"> Table 3 investigates the effect of the look ahead, and presents results with different depth parameters (depth= 1 means no search ), showing a consistent trend of improvement.</Paragraph>
      <Paragraph position="3"> Table 4 breaks down the results as a function of the sentence length; it is especially noticeable that the system also performs very well for long  sentences, another indication for its global performance robustness.</Paragraph>
      <Paragraph position="4"> Table 5 shows the results with three settings of the POS tagger. The best result is, naturally, when we use the gold standard also in testing. However, it is worthwhile noticing that it is better to train with the same POS tagger available in testing, even if its performance is somewhat lower. Table 6 compares the performances of several of the state of the art dependency parsing systems with ours. When comparing with other dependency parsing systems it is especially worth noticing that our system gives signi cantly better accuracy on completely parsed sentences.</Paragraph>
      <Paragraph position="5"> Interestingly, in the experiments, we allow the parsing algorithm to run many rounds to parse a sentece in the testing stage. However, we found that over 99% sentences can be parsed in a single round. This supports for our justi cation about the correctness of our model.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="70" end_page="71" type="metho">
    <SectionTitle>
5 Further Work and Conclusion
</SectionTitle>
    <Paragraph position="0"> We have addressed the problem of using learned classi ers in a pipeline fashion, where a task is decomposed into several stages and stage classi ers are used sequentially, where each stage may use the outcome of previous stages as its input. This is a common computational strategy in natural language processing and is known to suffer from error accumulation and an inability to correct mistakes in previous stages.</Paragraph>
    <Paragraph position="1"> Sent. Len. DA RA CA LA  periment is done with depth = 4.</Paragraph>
    <Paragraph position="2">  ging in a pipeline model. We set depth= 4 in all the experiments of this table.</Paragraph>
    <Paragraph position="3">  work with other dependency parsing systems. We abstracted two natural principles, one which calls for making the local classi ers used in the computation more reliable and a second, which suggests to devise the pipeline algorithm in such a way that minimizes the number of decisions (actions) made.</Paragraph>
    <Paragraph position="4"> We study this framework in the context of designing a bottom up dependency parsing. Not only we manage to use this framework to justify several design decisions, but we also show experimentally that following these results in improving the accuracy of the inferred trees relative to existing models. Interestingly, we can show that the trees produced by our algorithm are relatively good even for long sentences, and that our algorithm is doing especially well when evaluated globally, at a sentence level, where our results are signi cantly better than those of existing approaches perhaps showing that the design goals were achieved.</Paragraph>
    <Paragraph position="5"> Our future work includes trying to generalize this work to non-projective dependency parsing, as well as attempting to incorporate additional sources of information (e.g., shallow parsing information) into the pipeline process.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML