XML Viewer - w04-2407

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2407_metho.xml
Size: 9,129 bytes
Last Modified: 2025-10-06 14:09:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2407">
  <Title>Memory-Based Dependency Parsing</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Target Function and Approximation
</SectionTitle>
      <Paragraph position="0"> The function we want to approximate is a mapping f from parser configurations to parser actions, where each action consists of a transition and (unless the transition is Shift or Reduce) a dependency type:</Paragraph>
      <Paragraph position="2"> Here Config is the set of all possible parser configurations and R is the set of dependency types as before.</Paragraph>
      <Paragraph position="3"> However, in order to make the problem tractable, we try to learn a function ^f whose domain is a finite space of parser states, which are abstractions over configurations.</Paragraph>
      <Paragraph position="4"> For this purpose we define a number of features that can be used to define different models of parser state. The features used in this study are listed in Table 1.</Paragraph>
      <Paragraph position="5"> The first five features (TOP-TOP.RIGHT) deal with properties of the token on top of the stack. In addition to the word form itself (TOP), we consider its part-of-speech (as assigned by an automatic part-of-speech tagger in a preprocessing phase), the dependency type by which it is related to its head (which may or may not be available in a given configuration depending on whether the head is to the left or to the right of the token in question), and the dependency types by which it is related to its leftmost and rightmost dependent, respectively (where the current rightmost dependent may or may not be the rightmost dependent in the complete dependency tree).</Paragraph>
      <Paragraph position="6"> The following three features (NEXT-NEXT.LEFT) refer to properties of the next input token. In this case, there are no features corresponding to TOP.DEP and TOP.RIGHT, since the relevant dependencies can never be present at decision time. The final feature (LOOK) is a simple lookahead, using the part-of-speech of the next plus one input token.</Paragraph>
      <Paragraph position="7"> In the experiments reported below, we have used two different parser state models, one called the lexical model, which includes all nine features, and one called the non-lexical model, where the two lexical features TOP and NEXT are omitted. For both these models, we have used memory-based learning with different parameter settings, as implemented TiMBL.</Paragraph>
      <Paragraph position="8"> For comparison, we have included an earlier classifier that uses the same features as the non-lexical model, but where prediction is based on maximum conditional likelihood estimation. This classifier always predicts the most probable transition given the state and the most probable dependency type given the transition and the state, with conditional probabilities being estimated by the empirical distribution in the training data. Smoothing is performed only for zero frequency events, in which case the classifier backs off to more general models by omitting first the features TOP.LEFT and LOOK and then the features TOP.RIGHT and NEXT.LEFT; if even this does not help, the classifier predicts Reduce if permissible and Shift otherwise. This model, which we will refer to as the MCLE model, is described in more detail in Nivre (2004).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Data
</SectionTitle>
      <Paragraph position="0"> It is standard practice in data-driven approaches to natural language parsing to use treebanks both for training and evaluation. Thus, the Penn Treebank of American English (Marcus et al., 1993) has been used to train and evaluate the best available parsers of unrestricted English text (Collins, 1999; Charniak, 2000). One problem when developing a parser for Swedish is that there is no comparable large-scale treebank available for Swedish.</Paragraph>
      <Paragraph position="1"> For the experiments reported in this paper we have used a manually annotated corpus of written Swedish, created at Lund University in the 1970's and consisting mainly of informative texts from official sources (Einarsson, 1976). Although the original annotation scheme is an eclectic combination of constituent structure, dependency structure, and topological fields (Teleman, 1974), it has proven possible to convert the annotated sentences to dependency graphs with fairly high accuracy.</Paragraph>
      <Paragraph position="2"> In the conversion process, we have reduced the original fine-grained classification of grammatical functions to a more restricted set of 16 dependency types, which are listed in Table 2. We have also replaced the original (manual) part-of-speech annotation by using the same automatic tagger that is used for preprocessing in the parser. This is a standard probabilistic tagger trained on the Stockholm-Ume@a Corpus of written Swedish (SUC, 1997) and found to have an accuracy of 95-96% when tested on held-out data.</Paragraph>
      <Paragraph position="3"> Since the function we want to learn is a mapping from parser states to transitions (and dependency types), the treebank data cannot be used directly as training and test  data. Instead, we have to simulate the parser on the tree-bank in order to derive, for each sentence, the transition sequence corresponding to the correct dependency tree.</Paragraph>
      <Paragraph position="4"> Given the result of this simulation, we can construct a data set consisting of pairs &lt;s,t&gt; , where s is a parser state and t is the correct transition from that state (including a dependency type if applicable). Unlike standard shift-reduce parsing, the simulation of the current algorithm is almost deterministic and is guaranteed to be correct if the input dependency tree is well-formed.</Paragraph>
      <Paragraph position="5"> The complete converted treebank contains 6316 sentences and 97623 word tokens, which gives a mean sentence length of 15.5 words. The treebank has been divided into three non-overlapping data sets: 80% for training 10% for development/validation, and 10% for final testing (random samples). The results presented below are all from the validation set. (The final test set has not been used at all in the experiments reported in this paper.) When talking about test and validation data, we make a distinction between the sentence data, which refers to the original annotated sentences in the treebank, and the transition data, which refers to the transitions derived by simulating the parser on these sentences. While the sentence data for validation consists of 631 sentences, the corresponding transition data contains 15913 instances.</Paragraph>
      <Paragraph position="6"> For training, only transition data is relevant and the training data set contains 371977 instances.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Evaluation
</SectionTitle>
      <Paragraph position="0"> The output of the memory-based learner is a classifier that predicts the next transition (including dependency type), given the current state of the parser. The quality of this classifier has been evaluated with respect to both prediction accuracy and parsing accuracy.</Paragraph>
      <Paragraph position="1"> Prediction accuracy refers to the quality of the classifier as such, i.e. how well it predicts the next transition given the correct parser state, and is measured by the classification accuracy on unseen transition data (using a 0-1 loss function). We use McNemar's test for statistical significance. null Parsing accuracy refers to the quality of the classifier as a guide for the deterministic parser and is measured by the accuracy obtained when parsing unseen sentence data. More precisely, parsing accuracy is measured by the attachment score, which is a standard measure used in studies of dependency parsing (Eisner, 1996; Collins et al., 1999). The attachment score is computed as the proportion of tokens (excluding punctuation) that are assigned the correct head (or no head if the token is a root). Since parsing is a sentence-level task, we believe that the overall attachment score should be computed as the mean attachment score per sentence, which gives an estimate of the expected attachment score for an arbitrary sentence. However, since most previous studies instead use the mean attachment score per word (Eisner, 1996; Collins et al., 1999), we will give this measure as well.</Paragraph>
      <Paragraph position="2"> In order to measure label accuracy, we also define a labeled attachment score, where both the head and the label must be correct, but which is otherwise computed in the same way as the ordinary (unlabeled) attachment score.</Paragraph>
      <Paragraph position="3"> For parsing accuracy, we use a paired t-test for statistical significance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML