XML Viewer - n03-3006

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-3006_metho.xml
Size: 19,068 bytes
Last Modified: 2025-10-06 14:08:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-3006">
  <Title>A low-complexity, broad-coverage probabilistic Dependency Parser for English</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Dependency Grammar
</SectionTitle>
    <Paragraph position="0"> This system quite strictly follows DG assumptions. Dependency Grammar (DG) is essentially a valency grammar in which the valency concept is extended from verbs to nouns and adjectives and finally to all word classes.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Relation to Constituency
</SectionTitle>
      <Paragraph position="0"> In its simplest definition, a projective DG is a binary version (except for valency, see 2.2) of a constituent grammar which only knows lexical items, which entails that * for every mother node, the mother node and exactly one of its daughters, the so-called head, are isomorphic null * projection is deterministic, endocentric and can thus not fail, which gives DG a robustness advantage * equivalent constituency CFG trees can be derived * it is in Chomsky Normal Form (CNF), the efficient CYK parsing algorithm can thus be used Any DG has an equivalent constituency counterpart (Covington, 1994). Figure 1 shows a dependency structure and its unlabeled constituency counterpart.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Valency as an isomorphism constraint
</SectionTitle>
      <Paragraph position="0"> Total equivalence between mother and head daughter could not prevent a verb from taking an infinite number of subjects or objects. Therefore, valency theory is as vital a part of DG as is its constituency counterpart, subcategorization. The manually written rules check the most obvious valency constraints. Verbal valency is modeled by a PCFG for VP production.</Paragraph>
      <Paragraph position="1"> What did you think Mary said</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Functionalism
</SectionTitle>
      <Paragraph position="0"> DG was originally conceived to be a deep-syntactic, proto-semantic theory (Tesni`ere, 1959). The version of DG used here retains syntactic functions as dependency labels, like in LFG, which means that the dependency analyses returned by the parser are also a simple version of LFG f-structures, a hierarchy of syntactic relations between lexical heads which serves as a bridgehead to semantics. Functional DG only accepts content words as heads. This has the advantage that no empty heads (for example empty complementizers for zero-relatives) are needed. It also means that its syntactical structures are closer to argument-structure representation than traditional constituency-based structures such as those of GB or the Treebank. The closeness to argument-structure makes them especially useful for subsequent stages of knowledge management processing.</Paragraph>
      <Paragraph position="1"> A restricted use of Tesni`ere-style translations is also made. Adjectives outside a noun chunk may function as a nominal constituent (the poor/JJ are the saint/JJ). Participles may function as adjectives (Western industrialized/VBN countries). Present participles may also function as nouns (after winning/VBG the race).</Paragraph>
      <Paragraph position="2"> Traditional constituency analyses such as those in the Treebank contain many discontinuous constituents, also known as long-distance dependencies, expressed by the use of structure-copying methods. This parser deals with them by allowing non-projectivity in specific, well-defined situations, such as in WH-questions (Figure 2). But in order to keep complexity low, discontinuity is restricted to a minimum. Many long-distance dependencies are not strictly necessary. For example, the analysis of passive clauses does not need to involve discontinuity, in which a subordinate VP whose absent object is structure-shared with the subject of the superordinate VP. Because the verb form allows a clear identification of passive clauses, a surface analysis is sufficient, as long as an appropriate probability model is used. In this parser, passive subjects use their own probability model, which is completely distinct from active subjects.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Mapping the Treebank to Functional
Dependency
</SectionTitle>
      <Paragraph position="0"> A popular query tool for the extraction of tree structures from Treebanks, tgrep, has been used for the mapping to dependencies. The mapping from a configurational paradigm to a functional one turns out to be non-trivial (Musillo and Sima'an, 2002). A relatively simple example, the verb-object (obj) relation is discussed now.</Paragraph>
      <Paragraph position="1"> In a first approximation, a verb-object relation holds between the head of a VP and the head of the NP immediately under the VP. In most cases, the VP head is the lowest verb and the NP head is the lowest rightmost noun.</Paragraph>
      <Paragraph position="2"> As tgrep seriously overgenerates, a large number of highly specific subqueries had to be used, specifying all possible configurations of arbitrarily nested NPs and VPs. Since hundreds of possible configurations are thus mapped onto one dependency relation, statistical models based on them are much less sparse than lexicalized PCFGs, which is an advantage as lexicalized models often suffer from sparseness. In order to extract relations compatible to the parser's treatment of conjunction and apposition, the queries had to be further specified, thereby missing few structures that should match.</Paragraph>
      <Paragraph position="3"> In order to restrict discontinuity to where it is strictly necessary, copular verb complements and small clause complements are also treated as objects. Since the function of such objects can be unambiguously derived from a verb's lexical entry this is a linguistically viable decision. The mapping from the Penn treebank to dependencies by means of tgrep is a close approximation but not a complete mapping. A few structures corresponding to a certain dependency are almost certain to be missed or doubled. Also, structures involving genuine discontinuity like the verb-object relation in figure 2 are not extracted.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Probabilistic Models of Language
</SectionTitle>
    <Paragraph position="0"> Writing grammar rules is an easy task for a linguist, particularly when using a framework that is close to traditional school grammar assumptions, such as DG. Acknowledged facts such as the one that a verb has typically one but never two subjects are expressed in hand-written declarative rules. The rules of this parser are based on the Treebank tags of heads of chunks. Since the tagset is limited and dependency rules are binary, even a broad-coverage set of rules can be written in relatively little time.</Paragraph>
    <Paragraph position="1"> What is much more difficult, also for a linguist, is to assess the scope of application of a rule and the amount of ambiguity it creates. Long real-world sentences typically have dozens to hundreds of syntactically correct complete analyses and thousands of partial analyses, although most of them are semantically so odd that one would never think of them. Here, machine-learning approaches, such as probabilizing the manually written rules, are vital to any parser, for two reasons: first, the syntactically possible analyses can be ranked according to their probabilities. For subsequent processing stages like semantic interpretation or document classification it then often suffices to take the first ranked or the n first ranked readings. Second, in the course of the parsing process, very improbable analyses can be abandoned, which greatly improves parsing efficiency (see section 4).</Paragraph>
    <Paragraph position="2"> The parser uses two linguistic probability models. The first one is based on the lexical probabilities of the heads of phrases. Two simple extensions for the interaction between several dependents of the same mother node are also used. The second probability model is a PCFG for the expansion of the VP.</Paragraph>
    <Paragraph position="3"> Since the parser aims at a global disambiguation, all local probabilities are stored in the parsing chart. The global probability of a parse is the product of all its local probabilities, a product of disambiguation decisions.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Lexical Dependencies
</SectionTitle>
      <Paragraph position="0"> Given two adjacent lexical heads (say a and b), the probabilities of the possible dependency relations between them are calculated as Maximum Likelihood (MLE) estimates. In a binary CFG, constituents which are adjacent at some stage in the parsing process are candidates for the right-hand side (RHS) of a rewrite rule. If a rule exists for these constituents (say A and B), then in a DG or in Bare Phrase Structure, one of these is isomorphic to the LHS, i.e. the head. DG rules additionally use a syntactic relation label R, for which the probabilities are calculated in this probability model. The dependency rules used are based on Treebank tags, the relation probabilities are conditioned on them and on the lexical heads.</Paragraph>
      <Paragraph position="2"> relation the dependency is towards the right, it is therefore rewritten as right.</Paragraph>
      <Paragraph position="4"> Such a probability model is used to model the local competition between object and adjunct relation (he left town vs. he left yesterday), in which the verb is always the left RHS constituent. But in some cases, the direction is also a parameter, for example in the subject-verb relation (she said versus said she). There, the probability space is divided into two equal sections.</Paragraph>
      <Paragraph position="6"> The PP-attachment model probabilities are conditioned on three lexical heads - the verb, the preposition and the description noun (Collins and Brooks, 1995). The probability model is backed off across several levels. In addition to backing off to only partly lexicalized counts (ibid.), semantic classes are also used in all the modeled relations, for verbs the Levin classes (Levin, 1993), for nouns the top Wordnet class (Fellbaum, 1998) of the most frequent sense. As an alternative to backing-off, linear interpolation with the back-off models has also been tried, but the difference in performance is very small.</Paragraph>
      <Paragraph position="7"> A large subset of syntactic relations, the ones which are considered to be most relevant for argument structure, are modeled, specifically:  verb-subord. clause sentobj saw (they) came verb-prep. phrase pobj slept in bed noun-prep. phrase modpp draft of paper noun-participle modpart report written verb-complementizer compl to eat apples noun-preposition prep to the house Until now one relation has two distinct probability models: verb-subject is different for active and passive verbs, henceforth referred to as asubj and psubj, where needed. The disambiguation between complementizer and preposition is necessary as the Treebank tagset unfortunately uses the same tag (IN) for both. Many relations have slightly individualized models. As an example the modpart relation will be discussed in detail.</Paragraph>
      <Paragraph position="8"> 3.1.1 An Example: Modification by Participle The noun-participle relation is also known as reduced relative clause. In the Treebank, reduced relative clauses are adjoined to the NP they modify, and under certain conditions also have an explicit RRC label. Reduced relative clauses are frequent enough to warrant a probabilistic treatment, but considerably sparser than verb-non-passive-subject or verb-object relations. They are in direct competition with the subject-verb relation, because its candidates are also a NP followed by a VP. We probably have a subject-verb relation in the report announced the deal and a noun-participle relation in the report announced yesterday. The majority of modification by participle relations, if the participle is a past participle, functionally correspond to passive constructions (the report written [?]= the report which has been written). In order to reduce data sparseness, which could lead to giving preference to a verb-non-passive-subject reading (asubj), the verb-passive-subject counts (psubj) are added to the noun-participle counts. Some past participles also express adjunct readings (the week ended Friday); therefore the converse, i.e. adding noun-participle counts to verb-passive-subject counts, is not recommended.</Paragraph>
      <Paragraph position="9"> The next back-off step maps the noun a to its Wordnetclass @a and the verb b to its Levin-class @b. If the counts are still zero, counts on only the verb and then only the noun are used.</Paragraph>
      <Paragraph position="11"> As the last backoff, a low non-zero probability is assigned. In the verb-adjunct relation, which drastically increases complexity but can only occur with a closed class of nouns (mostly adverbial expressions of time), this last backoff is not used.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Interaction between Several Dependents
</SectionTitle>
      <Paragraph position="0"> For the verb-prepositional-phrase relation, two models that take the interaction between the several PPs of the same verb into account have been implemented. They are based on the verbal head and the prepositions.</Paragraph>
      <Paragraph position="1"> The first one estimates the probability of attaching a PP introduced by preposition p2, given that the verb to which it could be attached already has another PP introduced by the preposition p1. Back-offs using the verb-class @v and then the preposition(s) only are used.</Paragraph>
      <Paragraph position="3"> The second model estimates the probability of attaching a PP introduced by preposition p2 as a non-first PP.</Paragraph>
      <Paragraph position="4"> The usual backoffs are not printed here.</Paragraph>
      <Paragraph position="6"> As prepositions are a closed class, a zero probability is assigned if the last back-offs fail.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 PCFG for Verbal Subcategoriation and VP
Production
</SectionTitle>
      <Paragraph position="0"> Verbs often have several dependents. Ditransive verbs, for example, have up to three NP complements, the subject, the direct object and the indirect object. An indeterminate number of adjuncts can be added. Transitivity, expressed by a verb's subcategorization, is strongly lexicalized. But because the Treebank does not distinguish arguments and complements, and because a standard lexicon does not contain probabilistic subcategorization, a probabilistic model has advantages. Dependency models as discussed hitherto fail to model complex dependencies between the dependents of the same mother, unlike PCFGs. A simple PCFG model for the production of the VP rule which is lexicalized on the VP head and has a non-lexicalized backoff, is therefore used. RHS constituents C, for the time being, are unlexicalized phrasal categories like NP,PP, Comma, etc. At some stage in the parsing process, given an attachment candidate Cn and a verbal head v which already has attached constituents C1 to Cn[?]1, the probability of attaching Cn is estimated. This probability can also be seen as the probability of continuing versus ending the VP under production. null</Paragraph>
      <Paragraph position="2"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Implementation
</SectionTitle>
    <Paragraph position="0"> The parser has been implemented in Prolog, it runs in SWI-Prolog and Sicstus Prolog. For SWI-Prolog, a graphical interface has also been programmed in XPCE1.</Paragraph>
    <Paragraph position="1">  If no analysis spanning the entire length of the sentence can be found, an optimal path of partial structures spanning as much of the sentence as possible is searched. The algorithm devised for this accepts the first-ranked of the longest of all the partial analyses found, say S. Then, it recursively searches for the first-ranked of the longest of the partial analyses found to the left and to the right of S, and so on, until all or most of the sentence is spanned.</Paragraph>
    <Paragraph position="2"> The parser uses the preprocessed input of a finite-state tagger-chunker. Finite-state technology is fast enough for unlimited amounts of data, taggers and chunkers are known to be reliable but not error-free, with typical error rates between 2 and 5 %. Tagging and chunking is done by a standard tagger and chunker, LTPos (Mikheev, 1997). Heads are extracted from the chunks and lemmatized (Minnen et al., 2000). Parsing takes place only between the heads of phrases, and only using the best tag suggested by the tagger, which leads to a reduction in complexity. The parser uses the CYK algorithm, which has parsing complexity of O(n3), where n is the number of words in a word-based, but only chunks in a headof-chunk-based model. The chunk to word relation is 1.52 for Treebank section 0. In a test with a toy NP and verb-group grammar parsing was about 4 times slower when using unchunked input. Due to the insufficiency of the toy grammar the lingusitic quality and the number of complete parses decreased. The average number of tags per token is 2.11 for the entire Treebank. With untagged input, every possible tag would have to be taken into consideration. Although untested, at least a similar slowdown as for unchunked input can be expected.</Paragraph>
    <Paragraph position="3"> In a hand-written grammar, some typical parsing errors can be corrected by the grammar engineer, or rules can explicitly ignore particularly error-prone distinctions. Examples of rules that can correct tagging errors without introducing many new errors are allowing VBD to act as a participle or the possible translation of VBG to an adjective. As an example of ignoring error-prone distinctions, the disambiguation between prepositions and verbal particles is unreliable. The grammar therefore makes no distinction and treats all verbal particles as prepositions, which leads to an incorrect but consistent analysis for phrasal verbs. A hand-written grammar allows to model complex but important phenomena which overstep manageable ML search spaces, such as discontinous analysis of questions can be expressed, while on the other hand rare and marginal rules can be left out to free resources. For tagging, (Samuelsson and Voutilainen, 1997) have shown that a manually built tagger can equal a statistical tagger.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML