XML Viewer - w06-2932

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2932_metho.xml
Size: 8,297 bytes
Last Modified: 2025-10-06 14:10:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2932">
  <Title>Multilingual Dependency Analysis with a Two-Stage Discriminative Parser</Title>
  <Section position="4" start_page="216" end_page="216" type="metho">
    <SectionTitle>
2 Stage 1: Unlabeled Parsing
</SectionTitle>
    <Paragraph position="0"> The first stage of our system creates an unlabeled parse y for an input sentence x. This system is primarily based on the parsing models described by McDonald and Pereira (2006). That work extends the maximum spanning tree dependency parsing framework (McDonald et al., 2005a; McDonald et al., 2005b) to incorporate features over multiple edges in the dependency graph. An exact projective and an approximate non-projective parsing algorithm are presented, since it is shown that non-projective dependency parsing becomes NP-hard when features are extended beyond a single edge.</Paragraph>
    <Paragraph position="1"> That system uses MIRA, an online large-margin learning algorithm, to compute model parameters.</Paragraph>
    <Paragraph position="2"> Its power lies in the ability to define a rich set of features over parsing decisions, as well as surface level features relative to these decisions. For instance, the system of McDonald et al. (2005a) incorporates features over the part of speech of words occurring between and around a possible head-dependent relation. These features are highly important to over-all accuracy since they eliminate unlikely scenarios such as a preposition modifying a noun not directly to its left, or a noun modifying a verb with another verb occurring between them.</Paragraph>
    <Paragraph position="3"> We augmented this model to incorporate morphological features derived from each token. Consider a proposed dependency of a dependent xj on the head xi, each with morphological features Mj and Mi respectively. We then add to the representation of the edge: Mi as head features, Mj as dependent features, and also each conjunction of a feature from both sets. These features play the obvious role of explicitly modeling consistencies and commonalities between a head and its dependents in terms of attributes like gender, case, or number. Not all data sets in our experiments include morphological features, so we use them only when available.</Paragraph>
  </Section>
  <Section position="5" start_page="216" end_page="217" type="metho">
    <SectionTitle>
3 Stage 2: Label Classification
</SectionTitle>
    <Paragraph position="0"> The second stage takes the output parse y for sentence x and classifies each edge (i,j) [?] y with a particular label l(i,j). Ideally one would like to make all parsing and labeling decisions jointly so that the shared knowledge of both decisions will help resolve any ambiguities. However, the parser is fundamentally limited by the scope of local factorizations that make inference tractable. In our case this means we are forced only to consider features over single edges or pairs of edges. However, in a two stage system we can incorporate features over the entire output of the unlabeled parser since that structure is fixed as input. The simplest labeler would be to take as input an edge (i,j) [?] y for sentence x and find the label with highest score,</Paragraph>
    <Paragraph position="2"> Doing this for each edge in the tree would produce the final output. Such a model could easily be trained using the provided training data for each language. However, it might be advantageous to know the labels of other nearby edges. For instance, if we consider a head xi with dependents xj1,...,xjM, it is often the case that many of these dependencies will have correlated labels. To model this we treat the labeling of the edges (i,j1),...,(i,jM) as a sequence labeling problem,</Paragraph>
    <Paragraph position="4"> We use a first-order Markov factorization of the</Paragraph>
    <Paragraph position="6"> in which each factor is the score of labeling the adjacent edges (i,jm) and (i,jm[?]1) in the tree y. We attempted higher-order Markov factorizations but they did not improve performance uniformly across languages and training became significantly slower.</Paragraph>
    <Paragraph position="7"> For score functions, we use simple dot products between high dimensional feature representations and a weight vector</Paragraph>
    <Paragraph position="9"> Assuming we have an appropriate feature representation, we can find the highest scoring label sequence with Viterbi's algorithm. We use the MIRA  online learner to set the weights (Crammer and Singer, 2003; McDonald et al., 2005a) since we found it trained quickly and provide good performance. Furthermore, it made the system homogeneous in terms of learning algorithms since that is what is used to train our unlabeled parser (McDonald and Pereira, 2006). Of course, we have to define a set of suitable features. We used the following:  * Edge Features: Word/pre-suffix/part-of-speech (POS)/morphological feature identity of the head and the dependent (affix lengths 2 and 3). Does the head and its dependent share a prefix/suffix? Attachment direction. What morphological features do head and dependent have the same value for? Is the dependent the first/last word in the sentence? * Sibling Features: Word/POS/pre-suffix/morphological feature identity of the dependent's nearest left/right siblings in the tree (siblings are words with same parent in the tree). Do any of the dependent's siblings share its POS? * Context Features: POS tag of each intervening word between head and dependent. Do any of the words between the head and the dependent have a parent other than the head? Are any of the words between the head and the dependent not a descendant of the head (i.e. non-projective edge)? * Non-local: How many children does the dependent have? What morphological features do the grandparent and the  dependent have identical values? Is this the left/rightmost dependent for the head? Is this the first dependent to the left/right of the head? Various conjunctions of these were included based on performance on held-out data. Note that many of these features are beyond the scope of the edge based factorizations of the unlabeled parser. Thus a joint model of parsing and labeling could not easily include them without some form of re-ranking or approximate parameter estimation.</Paragraph>
  </Section>
  <Section position="6" start_page="217" end_page="217" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> We trained models for all 13 languages provided by the CoNLL organizers (Buchholz et al., 2006).</Paragraph>
    <Paragraph position="1"> Based on performance from a held-out section of the training data, we used non-projective parsing algorithms for Czech, Danish, Dutch, German, Japanese, Portuguese and Slovene, and projective parsing algorithms for Arabic, Bulgarian, Chinese, Spanish, Swedish and Turkish. Furthermore, for Arabic and Spanish, we used lemmas instead of inflected word  Unlabeled (UA) and Labeled Accuracy (LA).</Paragraph>
    <Paragraph position="2"> forms, again based on performance on held-out data1.</Paragraph>
    <Paragraph position="3"> Results on the test set are given in Table 1. Performance is measured through unlabeled accuracy, which is the percentage of words that modify the correct head in the dependency graph, and labeled accuracy, which is the percentage of words that modify the correct head and label the dependency edge correctly in the graph. These results show that the discriminative spanning tree parsing framework (McDonald et al., 2005b; McDonald and Pereira, 2006) is easily adapted across all these languages. Only Arabic, Turkish and Slovene have parsing accuracies significantly below 80%, and these languages have relatively small training sets and/or are highly inflected with little to no word order constraints. Furthermore, these results show that a two-stage system can achieve a relatively high performance. In fact, for every language our models perform significantly higher than the average performance for all the systems reported in Buchholz et al. (2006).</Paragraph>
    <Paragraph position="4"> For the remainder of the paper we provide a general error analysis across a wide set of languages plus a detailed error analysis of Spanish and Arabic.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML