File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2407_intro.xml

Size: 10,163 bytes

Last Modified: 2025-10-06 14:02:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2407">
  <Title>Memory-Based Dependency Parsing</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Dependency Graphs
</SectionTitle>
      <Paragraph position="0"> The linguistic tradition of dependency grammar comprises a large and fairly diverse family of theories and formalisms that share certain basic assumptions about syntactic structure, in particular the assumption that syntactic structure consists of lexical nodes linked by binary relations called dependencies (see, e.g., Tesni`ere (1959), Sgall (1986), Mel'Vcuk (1988), Hudson (1990)). Thus, the common formal property of dependency structures, as compared to the representations based on constituency (or phrase structure), is the lack of nonterminal nodes.</Paragraph>
      <Paragraph position="1"> In a dependency structure, every word token is dependent on at most one other word token, usually called its head or regent, which means that the structure can be represented as a directed graph, with nodes representing word tokens and arcs representing dependency relations.</Paragraph>
      <Paragraph position="2"> In addition, arcs may be labeled with specific dependency types. Figure 1 shows a labeled dependency graph for a simple Swedish sentence, where each word of the sentence is labeled with its part of speech and each arc labeled with a grammatical function.</Paragraph>
      <Paragraph position="3"> Formally, we define dependency graphs in the following way:</Paragraph>
      <Paragraph position="5"> 1. Let R = {r1,...,rm} be the set of permissible dependency types (arc labels).</Paragraph>
      <Paragraph position="6"> 2. A dependency graph for a string of words W = w1***wn is a labeled directed graph D = (W,A), where (a) W is the set of nodes, i.e. word tokens in the input string, (b) A is a set of labeled arcs (wi,r,wj) (where wi,wj [?] W and r [?] R).</Paragraph>
      <Paragraph position="7">  We write wi &lt; wj to express that wi precedes wj in the string W (i.e., i &lt; j); we write wi r- wj to say that there is an arc from wi to wj labeled r, and wi - wj to say that there is an arc from wi to wj (regardless of the label); we use -[?] to denote the reflexive and transitive closure of the unlabeled arc relation; and we use - and -[?] for the corresponding undirected relations, i.e. wi - wj iff wi - wj  or wj - wi.</Paragraph>
      <Paragraph position="8"> 3. A dependency graph D = (W,A) is well-formed iff the five conditions given in Figure 2 are satisfied.  For a more detailed discussion of dependency graphs and well-formedness conditions, the reader is referred to Nivre (2003).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Parsing Algorithm
</SectionTitle>
      <Paragraph position="0"> The parsing algorithm presented in Nivre (2003) is in many ways similar to the basic shift-reduce algorithm for context-free grammars (Aho et al., 1986), although the parse actions are different given that no nonterminal symbols are used. Moreover, unlike the algorithm of Yamada and Matsumoto (2003), the algorithm considered here actually uses a blend of bottom-up and top-down processing, constructing left-dependencies bottom-up and rightdependencies top-down, in order to achieve incrementality. For a similar but nondeterministic approach to dependency parsing, see Obrebski (2003).</Paragraph>
      <Paragraph position="1"> Parser configurations are represented by triples &lt;S,I,A&gt; , where S is the stack (represented as a list), I is the list of (remaining) input tokens, and A is the (current) arc relation for the dependency graph. Given an input string W, the parser is initialized to &lt;nil,W,[?]&gt; and terminates when it reaches a configuration &lt;S,nil,A&gt; (for any  list S and set of arcs A). The input string W is accepted if the dependency graph D = (W,A) given at termination is well-formed; otherwise W is rejected. The behavior of the parser is defined by the transitions defined in Figure 3 (where wi, wj and wk are arbitrary word tokens, and r and rprime are arbitrary dependency relations): 1. The transition Left-Arc (LA) adds an arc wj r-wi from the next input token wj to the token wi on top of the stack and reduces (pops) wi from the stack.</Paragraph>
      <Paragraph position="2"> 2. The transition Right-Arc (RA) adds an arc wi r-wj from the token wi on top of the stack to the next input token wj, and shifts (pushes) wj onto the stack.</Paragraph>
      <Paragraph position="3"> 3. The transition Reduce (RE) reduces (pops) the token wi on top of the stack.</Paragraph>
      <Paragraph position="4"> 4. The transition Shift (SH) shifts (pushes) the next in- null put token wi onto the stack.</Paragraph>
      <Paragraph position="5"> The transitions Left-Arc and Right-Arc are subject to conditions that ensure that the graph conditions Unique label and Single head are satisfied. By contrast, the Reduce transition can only be applied if the token on top of the stack already has a head. For Shift, the only condition is that the input list is non-empty.</Paragraph>
      <Paragraph position="6"> As it stands, this transition system is nondeterministic, since several transitions can often be applied to the same configuration. Thus, in order to get a deterministic parser, we need to introduce a mechanism for resolving transition conflicts. Regardless of which mechanism is used, the parser is guaranteed to terminate after at most 2n transitions, given an input string of length n (Nivre, 2003). This means that as long as transitions can be performed in constant time, the running time of the parser will be linear in the length of the input. Moreover, the parser is guaranteed to produce a dependency graph that is acyclic and projective (and satisfies the unique-label and single-head constraints). This means that the dependency graph given at termination is well-formed if and only if it is connected (Nivre, 2003).</Paragraph>
      <Paragraph position="7"> Unique label (wi r-wj [?] wi rprime-wj) = r = rprime</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Guided Parsing
</SectionTitle>
      <Paragraph position="0"> One way of turning a nondeterministic parser into a deterministic one is to use a guide (or oracle) that can inform the parser at each nondeterministic choice point; cf. Kay (2000), Boullier (2003). Guided parsing is normally used to improve the efficiency of a nondeterministic parser, e.g. by letting a simpler (but more efficient) parser construct a first analysis that can be used to guide the choice of the more complex (but less efficient) parser. This is the approach taken, for example, in Boullier (2003).</Paragraph>
      <Paragraph position="1"> In our case, we rather want to use the guide to improve the accuracy of a deterministic parser, starting from a baseline of randomized choice. One way of doing this is to use a treebank, i.e. a corpus of analyzed sentences, to train a classifier that can predict the next transition (and dependency type) given the current configuration of the parser. However, in order to maintain the efficiency of the parser, the classifier must also be implemented in such a way that each transition can still be performed in constant time.</Paragraph>
      <Paragraph position="2"> Previous work in this area includes the use of memory-based learning to guide a standard shift-reduce parser (Veenstra and Daelemans, 2000) and the use of support vector machines to guide a deterministic dependency parser (Yamada and Matsumoto, 2003). In the experiments reported in this paper, we apply memory-based learning within a deterministic dependency parsing framework.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Memory-Based Learning
</SectionTitle>
      <Paragraph position="0"> Memory-based learning and problem solving is based on two fundamental principles: learning is the simple storage of experiences in memory, and solving a new problem is achieved by reusing solutions from similar previously solved problems (Daelemans, 1999). It is inspired by the nearest neighbor approach in statistical pattern recognition and artificial intelligence (Fix and Hodges, 1952), as well as the analogical modeling approach in linguistics (Skousen, 1989; Skousen, 1992). In machine learning terms, it can be characterized as a lazy learning method, since it defers processing of input until needed and processes input by combining stored data (Aha, 1997).</Paragraph>
      <Paragraph position="1"> Memory-based learning has been successfully applied to a number of problems in natural language processing, such as grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking (Daelemans et al., 2002).</Paragraph>
      <Paragraph position="2"> Most relevant in the present context is the use of memory-based learning to predict the actions of a shift-reduce parser, with promising results reported in Veenstra and Daelemans (2000).</Paragraph>
      <Paragraph position="3"> The main reason for using memory-based learning in the present context is the flexibility offered by similarity-based extrapolation when classifying previously unseen configurations, since previous experiments with a probabilistic model has shown that a fixed back-off sequence does not work well in this case (Nivre, 2004). Moreover, the memory-based approach can easily handle multi-class classification, unlike the support vector machines used by Yamada and Matsumoto (2003).</Paragraph>
      <Paragraph position="4"> For the experiments reported in this paper, we have used the software package TiMBL (Tilburg Memory Based Learner), which provides a variety of metrics, algorithms, and extra functions on top of the classical k nearest neighbor classification kernel, such as value distance metrics and distance weighted class voting (Daelemans et al., 2003).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML