File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2923_intro.xml

Size: 4,736 bytes

Last Modified: 2025-10-06 14:04:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2923">
  <Title>LingPars, a Linguistically Inspired, Language-Independent Machine Learner for Dependency Treebanks</Title>
  <Section position="3" start_page="0" end_page="171" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes LingPars, a Constraint Gram mar-inspired language-independent treebank-learn er developed from scratch between January 9th and March 9th 2006 in the context of the CoNLL-X 2006 shared task (http://nextens.uvt.nl/~conll/), or ganized by Sabine Buchholz, Erwin Marsi, Yval Krymolowski and Amit Dubey. Training treebanks and test data were provided for 13 different lan guages: Arabic (Smrz et al. 2002), Chinese (Chen et al. 2003), Czech (Hajic et al. 2001), Danish (Kromann 2003), Dutch (van der Beek et al. 2002), German (Brants et.al 2002), Japanese (Kawata and Bartels), Portuguese (Afonso et al. 2002), Slovene (Dzerosky et al. 2006), Spanish (Palomar et al.</Paragraph>
    <Paragraph position="1"> 2004), Swedish (Nilsson et al. 2005), Turkish (Oflazer et al. 2003 and Nart et.al 2003), Bulgarian (Simov et al. 2005). A number of these treebanks were not originally annotated in dependency style, but transformed from constituent tree style for the task, and all differ widely in terms of tag granulari ty (21-302 part-of-speech tags, 7-82 function la bels). Also, not all treebanks included morphologi cal information, and only half offered a lemma field. Such descriptive variation proved to be a considerable constraint for our parser design, as will be explained in chapter 2. No external re sources and no structural preprocessing were used1.</Paragraph>
    <Paragraph position="2"> 2 Language independence versus theory independence While manual annotation and/or linguistic, rule-based parsers are necessary for the creation of its training data, only a machine learning based parser (as targeted in the CoNNL shared task) can hope to be truly language independent in its design. The question is, however, if this necessarily implies in dependence of linguistic/descriptive theory.</Paragraph>
    <Paragraph position="3"> In our own approach, LingPars, we thus depart ed from the Constraint Grammar descriptive model (Karlsson et al. 2005), where syntactic function tags (called DEPREL or dependency relations in the shared task) rank higher than dependency/con stituency and are established before head attach ments, rather than vice versa (as would be the case for many probabilistic, chunker based systems, or 1The only exception is what we consider a problem in the dependency-version of the German TIGER treebank, where postnominal attributes of nouns appear as dependents of that noun's head if the latter is a preposition, but not otherwise (e.g. if the head's head is a preposition). LingPars failed to learn this somewhat idiosyncratic distinction, but performance improved when the analysis was pre processed with an additional np-layer (to be re-flattened after parsing.).  the classical PENN treebank descriptive model). In our hand-written, rule based parsers, dependency treebanks are constructed by using sequential at tachment rules, generally attaching functions (e.g.</Paragraph>
    <Paragraph position="4"> subject, object, postnominal) to forms (finite verb, noun) or lexical tags (tense, auxiliary, transitive), with a direction condition and the possibility of added target, context or barrier conditions (Bick 2005).</Paragraph>
    <Paragraph position="5"> In LingPars, we tried to mimic this methodology by trying to learn probabilities for both CG style syntactic-function contexts and function-to-form attachment rules. We could not, however, imple ment the straightforward idea of learning probabili ties and optimal ordering for an existing body of (manual) seeding rules, because the 13 treebanks were not harmonized in their tag sets and descrip tive conventions2.</Paragraph>
    <Paragraph position="6"> As an example, imagine a linguistic rule that triggers &amp;quot;subclause-hood&amp;quot; for a verb-headed de pendency-node as soon as a subordinator attaches to it, and then, implementing &amp;quot;subclause-hood&amp;quot;, tries to attach the verb not to the root, but to anoth er verb left of the subordinator, or right to a rootattaching verb. For the given set of treebanks prob abilities and ordering priorities for this rule cannot be learned by one and the same parser, simply be cause some treebanks attach the verb to the subor dinator rather than vice versa, and for verb chains, there is no descriptive consensus as to whether the auxiliary/construction verb (e.g. Spanish) or the main verb (e.g. Swedish) is regarded as head.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML