XML Viewer - p06-2041

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2041_metho.xml
Size: 18,652 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2041">
  <Title>Discriminative Classifiers for Deterministic Dependency Parsing</Title>
  <Section position="4" start_page="316" end_page="318" type="metho">
    <SectionTitle>
2 Inductive Dependency Parsing
</SectionTitle>
    <Paragraph position="0"> The system we use for the experiments uses no grammar but relies completely on inductive learning from treebank data. The methodology is based  on three essential components: 1. Deterministic parsing algorithms for building dependency graphs (Kudo and Matsumoto, 2002; Yamada and Matsumoto, 2003; Nivre, 2003) 2. History-based models for predicting the next parser action (Black et al., 1992; Magerman, 1995; Ratnaparkhi, 1997; Collins, 1999) 3. Discriminative learning to map histories to  parser actions (Kudo and Matsumoto, 2002; Yamada and Matsumoto, 2003; Nivre et al., 2004) In this section we will define dependency graphs, describe the parsing algorithm used in the experiments and finally explain the extraction of features for the history-based models.</Paragraph>
    <Section position="1" start_page="316" end_page="317" type="sub_section">
      <SectionTitle>
2.1 Dependency Graphs
</SectionTitle>
      <Paragraph position="0"> A dependency graph is a labeled directed graph, the nodes of which are indices corresponding to the tokens of a sentence. Formally: Definition 1 Given a set R of dependency types (arc labels), a dependency graph for a sentence</Paragraph>
      <Paragraph position="2"> The set V of nodes (or vertices) is the set Zn+1 = {0,1,2,...,n} (n [?] Z+), i.e., the set of non-negative integers up to and including n. This means that every token index i of the sentence is a node (1 [?] i [?] n) and that there is a special node 0, which does not correspond to any token of the sentence and which will always be a root of the dependency graph (normally the only root). We use V + to denote the set of nodes corresponding to tokens (i.e., V + = V [?]{0}), and we use the term token node for members of V +.</Paragraph>
      <Paragraph position="3"> The set E of arcs (or edges) is a set of ordered pairs (i,j), where i and j are nodes. Since arcs are used to represent dependency relations, we will say that i is the head and j is the dependent of the arc (i,j). As usual, we will use the notation i - j to mean that there is an arc connecting i and j (i.e., (i,j) [?] E) and we will use the notation i -[?] j for the reflexive and transitive closure of the arc relation E (i.e., i -[?] j if and only if i = j or there is a path of arcs connecting i to j).</Paragraph>
      <Paragraph position="4"> The function L assigns a dependency type (arc label) r [?] R to every arc e [?] E.</Paragraph>
      <Paragraph position="5"> Definition 2 A dependency graph G is well- null formed if and only if: 1. The node 0 is a root.</Paragraph>
      <Paragraph position="6"> 2. Every node has in-degree at most 1.</Paragraph>
      <Paragraph position="7"> 3. G is connected.1 4. G is acyclic.</Paragraph>
      <Paragraph position="8"> 5. G is projective.2  Conditions 1-4, which are more or less standard in dependency parsing, together entail that the graph is a rooted tree. The condition of projectivity, by contrast, is somewhat controversial, since the analysis of certain linguistic constructions appears to 1To be more exact, we require G to be weakly connected, which entails that the corresponding undirected graph is connected, whereas a strongly connected graph has a directed path between any pair of nodes.</Paragraph>
      <Paragraph position="9"> 2An arc (i,j) is projective iff there is a path from i to every node k such that i &lt; j &lt; k or i &gt; j &gt; k. A graph G is projective if all its arcs are projective.</Paragraph>
      <Paragraph position="11"> require non-projective dependency arcs. For the purpose of this paper, however, this assumption is unproblematic, given that all the treebanks used in the experiments are restricted to projective dependency graphs.</Paragraph>
      <Paragraph position="12"> Figure 1 shows a well-formed dependency graph for an English sentence, where each word of the sentence is tagged with its part-of-speech and each arc labeled with a dependency type.</Paragraph>
    </Section>
    <Section position="2" start_page="317" end_page="317" type="sub_section">
      <SectionTitle>
2.2 Parsing Algorithm
</SectionTitle>
      <Paragraph position="0"> We begin by defining parser configurations and the abstract data structures needed for the definition of history-based feature models.</Paragraph>
      <Paragraph position="1"> Definition 3 Given a set R = {r0,r1,...rm} of dependency types and a sentence x = (w1,...,wn), a parser configuration for x is a  quadruple c = (s,t,h,d), where: 1. s is a stack of tokens nodes.</Paragraph>
      <Paragraph position="2"> 2. t is a sequence of token nodes.</Paragraph>
      <Paragraph position="3"> 3. h : V +x - V is a function from token nodes to nodes.</Paragraph>
      <Paragraph position="4"> 4. d : V +x - R is a function from token nodes to dependency types.</Paragraph>
      <Paragraph position="5"> 5. For every token node i [?] V +x , h(i) = 0 if  and only if d(i) = r0.</Paragraph>
      <Paragraph position="6"> The idea is that the sequence t represents the remaining input tokens in a left-to-right pass over the input sentence x; the stack s contains partially processed nodes that are still candidates for dependency arcs, either as heads or dependents; and the functions h and d represent a (dynamically defined) dependency graph for the input sentence x. We refer to the token node on top of the stack as the top token and the first token node of the input sequence as the next token.</Paragraph>
      <Paragraph position="7"> When parsing a sentence x = (w1,...,wn), the parser is initialized to a configuration c0 = (o,(1,...,n),h0,d0) with an empty stack, with all the token nodes in the input sequence, and with all token nodes attached to the special root node 0 with a special dependency type r0. The parser terminates in any configuration cm = (s,o,h,d) where the input sequence is empty, which happens after one left-to-right pass over the input.</Paragraph>
      <Paragraph position="8"> There are four possible parser transitions, two of which are parameterized for a dependency type r [?] R.</Paragraph>
      <Paragraph position="9">  1. LEFT-ARC(r) makes the top token i a (left) dependent of the next token j with dependency type r, i.e., j r- i, and immediately pops the stack.</Paragraph>
      <Paragraph position="10"> 2. RIGHT-ARC(r) makes the next token j a (right) dependent of the top token i with dependency type r, i.e., i r- j, and immediately pushes j onto the stack.</Paragraph>
      <Paragraph position="11"> 3. REDUCE pops the stack.</Paragraph>
      <Paragraph position="12"> 4. SHIFT pushes the next token i onto the stack.  The choice between different transitions is nondeterministic in the general case and is resolved by a classifier induced from a treebank, using features extracted from the parser configuration.</Paragraph>
    </Section>
    <Section position="3" start_page="317" end_page="318" type="sub_section">
      <SectionTitle>
2.3 Feature Models
</SectionTitle>
      <Paragraph position="0"> The task of the classifier is to predict the next transition given the current parser configuration, where the configuration is represented by a feature vector Ph(1,p) = (ph1,...,php). Each feature phi is a function of the current configuration, defined in terms of an address function aphi, which identifies a specific token in the current parser configuration, and an attribute function fphi, which picks out a specific attribute of the token.</Paragraph>
      <Paragraph position="1"> Definition 4 Let c = (s,t,h,d) be the current parser configuration.</Paragraph>
      <Paragraph position="2"> 1. For every i (i [?] 0), si and ti are address functions identifying the ith token of s and t, respectively (with indexing starting at 0).</Paragraph>
      <Paragraph position="3">  2. If a is an address function, then h(a), l(a), and r(a) are address functions, identifying the head (h), the leftmost child (l), and the rightmost child (r), of the token identified by a (according to the function h).</Paragraph>
      <Paragraph position="4"> 3. If a is an address function, then p(a), w(a)  and d(a) are feature functions, identifying the part-of-speech (p), word form (w) and dependency type (d) of the token identified by a. We call p, w and d attribute functions.</Paragraph>
      <Paragraph position="5"> A feature model is defined by specifying a vector of feature functions. In section 4.2 we will define the feature models used in the experiments.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="318" end_page="318" type="metho">
    <SectionTitle>
3 Learning Algorithms
</SectionTitle>
    <Paragraph position="0"> The learning problem for inductive dependency parsing, defined in the preceding section, is a pure classification problem, where the input instances are parser configurations, represented by feature vectors, and the output classes are parser transitions. In this section, we introduce the two machine learning methods used to solve this problem in the experiments.</Paragraph>
    <Section position="1" start_page="318" end_page="318" type="sub_section">
      <SectionTitle>
3.1 MBL
</SectionTitle>
      <Paragraph position="0"> MBL is a lazy learning method, based on the idea that learning is the simple storage of experiences in memory and that solving a new problem is achieved by reusing solutions from similar previously solved problems (Daelemans and Van den Bosch, 2005). In essence, this is a k nearest neighbor approach to classification, although a variety of sophisticated techniques, including different distance metrics and feature weighting schemes can be used to improve classification accuracy.</Paragraph>
      <Paragraph position="1"> For the experiments reported in this paper we use the TIMBL software package for memory-based learning and classification (Daelemans and Van den Bosch, 2005), which directly handles multi-valued symbolic features. Based on results from previous optimization experiments (Nivre et al., 2004), we use the modified value difference metric (MVDM) to determine distances between instances, and distance-weighted class voting for determining the class of a new instance. The parameters varied during experiments are the number k of nearest neighbors and the frequency threshold l below which MVDM is replaced by the simple Overlap metric.</Paragraph>
    </Section>
    <Section position="2" start_page="318" end_page="318" type="sub_section">
      <SectionTitle>
3.2 SVM
</SectionTitle>
      <Paragraph position="0"> SVM in its simplest form is a binary classifier that tries to separate positive and negative cases in training data by a hyperplane using a linear kernel function. The goal is to find the hyperplane that separates the training data into two classes with the largest margin. By using other kernel functions, such as polynomial or radial basis function (RBF), feature vectors are mapped into a higher dimensional space (Vapnik, 1998; Kudo and Matsumoto, 2001). Multi-class classification with n classes can be handled by the one-versus-all method, with n classifiers that each separate one class from the rest, or the one-versus-one method, with n(n [?] 1)/2 classifiers, one for each pair of classes (Vural and Dy, 2004). SVM requires all features to be numerical, which means that symbolic features have to be converted, normally by introducing one binary feature for each value of the symbolic feature.</Paragraph>
      <Paragraph position="1"> For the experiments reported in this paper we use the LIBSVM library (Wu et al., 2004; Chang and Lin, 2005) with the polynomial kernel K(xi,xj) = (gxTi xj +r)d,g &gt; 0, where d, g and r are kernel parameters. Other parameters that are varied in experiments are the penalty parameter C, which defines the tradeoff between training error and the magnitude of the margin, and the termination criterion o, which determines the tolerance of training errors.</Paragraph>
      <Paragraph position="2"> We adopt the standard method for converting symbolic features to numerical features by binarization, and we use the one-versus-one strategy for multi-class classification. However, to reduce training times, we divide the training data into smaller sets, according to the part-of-speech of the next token in the current parser configuration, and train one set of classifiers for each smaller set. Similar techniques have previously been used by Yamada and Matsumoto (2003), among others, without significant loss of accuracy. In order to avoid too small training sets, we pool together all parts-of-speech that have a frequency below a certain threshold t (set to 1000 in all the experiments).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="318" end_page="320" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the experimental setup, including data sets, feature models, parameter optimization, and evaluation metrics. Experimental results are presented in section 5.</Paragraph>
    <Section position="1" start_page="319" end_page="319" type="sub_section">
      <SectionTitle>
4.1 Data Sets
</SectionTitle>
      <Paragraph position="0"> The data set used for Swedish comes from Talbanken (Einarsson, 1976), which contains both written and spoken Swedish. In the experiments, the professional prose section is used, consisting of about 100k words taken from newspapers, text-books and information brochures. The data has been manually annotated with a combination of constituent structure, dependency structure, and topological fields (Teleman, 1974). This annotation has been converted to dependency graphs and the original fine-grained classification of grammatical functions has been reduced to 17 dependency types. We use a pseudo-randomized data split, dividing the data into 10 sections by allocating sentence i to section i mod 10. Sections 1-9 are used for 9-fold cross-validation during development and section 0 for final evaluation.</Paragraph>
      <Paragraph position="1"> The English data are from the Wall Street Journal section of the Penn Treebank II (Marcus et al., 1994). We use sections 2-21 for training, section 0 for development, and section 23 for the final evaluation. The head percolation table of Yamada and Matsumoto (2003) has been used to convert constituent structures to dependency graphs, and a variation of the scheme employed by Collins (1999) has been used to construct arc labels that can be mapped to a set of 12 dependency types.</Paragraph>
      <Paragraph position="2"> The Chinese data are taken from the Penn Chinese Treebank (CTB) version 5.1 (Xue et al., 2005), consisting of about 500k words mostly from Xinhua newswire, Sinorama news magazine and Hong Kong News. CTB is annotated with a combination of constituent structure and grammatical functions in the Penn Treebank style, and has been converted to dependency graphs using essentially the same method as for the English data, although with a different head percolation table and mapping scheme. We use the same kind of pseudo-randomized data split as for Swedish, but we use section 9 as the development test set (training on section 1-8) and section 0 as the final test set (training on section 1-9).</Paragraph>
      <Paragraph position="3"> A standard HMM part-of-speech tagger with suffix smoothing has been used to tag the test data with an accuracy of 96.5% for English and 95.1% for Swedish. For the Chinese experiments we have used the original (gold standard) tags from the treebank, to facilitate comparison with results previously reported in the literature.</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="2" start_page="319" end_page="319" type="sub_section">
      <SectionTitle>
4.2 Feature Models
</SectionTitle>
      <Paragraph position="0"> Table 1 describes the five feature models Ph1-Ph5 used in the experiments, with features specified in column 1 using the functional notation defined in section 2.3. Thus, p(s0) refers to the part-of-speech of the top token, while d(l(t0)) picks out the dependency type of the leftmost child of the next token. It is worth noting that models Ph1-Ph2 are unlexicalized, since they do not contain any features of the form w(a), while models Ph3-Ph5 are all lexicalized to different degrees.</Paragraph>
    </Section>
    <Section position="3" start_page="319" end_page="320" type="sub_section">
      <SectionTitle>
4.3 Optimization
</SectionTitle>
      <Paragraph position="0"> As already noted, optimization of learning algorithm parameters is a prerequisite for meaningful comparison of different algorithms, although an exhaustive search of the parameter space is usually impossible in practice.</Paragraph>
      <Paragraph position="1"> For MBL we have used the modified value difference metric (MVDM) and class voting weighted by inverse distance (ID) in all experiments, and performed a grid search for the optimal values of the number k of nearest neighbors and the frequency threshold l for switching from MVDM to the simple Overlap metric (cf.</Paragraph>
      <Paragraph position="2"> section 3.1). The best values are different for different combinations of data sets and models but are generally found in the range 3-10 for k and in the range 1-8 for l.</Paragraph>
      <Paragraph position="3"> The polynomial kernel of degree 2 has been used for all the SVM experiments, but the kernel parameters g and r have been optimized together with the penalty parameter C and the termination</Paragraph>
    </Section>
    <Section position="4" start_page="320" end_page="320" type="sub_section">
      <SectionTitle>
4.4 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> The evaluation metrics used for parsing accuracy are the unlabeled attachment score ASU, which is the proportion of tokens that are assigned the correct head (regardless of dependency type), and the labeled attachment score ASL, which is the proportion of tokens that are assigned the correct head and the correct dependency type. We also consider the unlabeled exact match EMU, which is the proportion of sentences that are assigned a completely correct dependency graph without considering dependency type labels, and the labeled exact match EML, which also takes dependency type labels into account. Attachment scores are presented as mean scores per token, and punctuation tokens are excluded from all counts. For all experiments we have performed a McNemar test of significance at a = 0.01 for differences between the two learning methods. We also compare learning and parsing times, as measured on an AMD 64-bit processor running Linux.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML