File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1013_metho.xml
Size: 23,774 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1013"> <Title>Speed and Accuracy in Shallow and Deep Stochastic Parsing</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Stochastic Parsing with LFG </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Parsing with Lexical-Functional Grammar </SectionTitle> <Paragraph position="0"> The grammar used for this experiment was developed in the ParGram project (Butt et al., 2002). It uses LFG as a formalism, producing c(onstituent)-structures (trees) and f(unctional)-structures (attribute value matrices) as output. The c-structures encode constituency and linear order. F-structures encode predicate-argument relations and other grammatical information, e.g., number, tense, statement type. The XLE parser was used to produce packed representations, specifying all possible grammar analyses of the input.</Paragraph> <Paragraph position="1"> In our system, tokenization and morphological analysis are performed by finite-state transductions arranged in a compositional cascade. Both the tokenizer and the morphological analyzer can produce multiple outputs. For example, the tokenizer will optionaly lowercase sentence initial words, and the morphological analyzer will produce walk +Verb +Pres +3sg and walk +Noun +Pl for the input form walks. The resulting tokenized and morphologically analyzed strings are presented to the symbolic LFG grammar.</Paragraph> <Paragraph position="2"> The grammar can parse input that has XML delimited named entity markup: <company>Columbia Savings</company> is a major holder of so-called junk bonds. To allow the grammar to parse this markup, the tokenizer includes an additional tokenization of the strings whereby the material between the XML markup is treated as a single token with a special morphological tag (+NamedEntity). As a fall back, the tokenization that the string would have received without that markup is also produced. The named entities have a single multiword predicate. This helps in parsing both because it means that no internal structure has to be built for the predicate and because predicates that would otherwise be unrecognized by the grammar can be parsed (e.g., Cie.</Paragraph> <Paragraph position="3"> Financiere de Paribas). As described in section 5, it was also important to use named entity markup in these experiments to more fairly match the analyses in the PARC 700 dependency bank.</Paragraph> <Paragraph position="4"> To increase robustness, the standard grammar is augmented with a FRAGMENT grammar. This allows sentences to be parsed as well-formed chunks specified by the grammar, in particular as Ss, NPs, PPs, and VPs, with unparsable tokens possibly interspersed. These chunks have both c-structures and f-structures corresponding to them. The grammar has a fewest-chunk method for determining the correct parse.</Paragraph> <Paragraph position="5"> The grammar incorporates a version of Optimality Theory that allows certain (sub)rules in the grammar to be prefered or disprefered based on OT marks triggered by the (sub)rule (Frank et al., 1998). The Complete version of the grammar uses all of the (sub)rules in a multi-pass system that depends on the ranking of the OT marks in the rules. For example, topicalization is disprefered, but the topicalization rule will be triggered if no other parse can be built. A one-line rewrite of the Complete grammar creates a Core version of the grammar that moves the majority of the OT marks into the NOGOOD space. This effectively removes the (sub)rules that they mark from the grammar. So, for example, in the Core grammar there is no topicalization rule, and sentences with topics will receive a FRAGMENT parse. This single-pass Core grammar is smaller than the Complete grammar and hence is faster.</Paragraph> <Paragraph position="6"> The XLE parser also allows the user to adjust performance parameters bounding the amount of work that is done in parsing for efficient but accurate operation.</Paragraph> <Paragraph position="7"> XLE's ambiguity management technology takes advantage of the fact that relatively few f-structure constraints apply to constituents that are far apart in the c-structure, so that sentences are typically parsed in polynomial time even though LFG parsing is known to be an NP-complete problem. But the worst-case exponential behavior does begin to appear for some constructions in some sentences, and the computational effort is limited by a SKIMMING mode whose onset is controlled by a user-specified parameter. When skimming, XLE will stop processing the subtree of a constituent whenever the amount of work exceeds that user-specified limit. The subtree is discarded, and the parser will move on to another subtree. This guarantees that parsing will be finished within reasonable limits of time and memory but at a cost of possibly lower accuracy if it causes the best analysis of a constituent to be discarded. As a separate parameter, XLE also lets the user limit the length of medial constituents, i.e., constituents that do not appear at the beginning or the end of a sentence (ignoring punctuation). The rationale behind this heuristic is to limit the weight of constituents in the middle of the sentence but still to allow sentence-final heavy constituents. This discards constituents in a somewhat more principled way as it tries to capture the psycholinguistic tendency to avoid deep center-embedding.</Paragraph> <Paragraph position="8"> When limiting the length of medial constituents, cubic-time parsing is possible for sentences up to that length, even with a deep, non-context-free grammar, and linear parsing time is possible for sentences beyond that length.</Paragraph> <Paragraph position="9"> The Complete grammar achieved 100% coverage of section 23 as unseen unlabeled data: 79% as full parses, 21% FRAGMENT and/or SKIMMED parses.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Dynamic Programming for Estimation and Stochastic Disambiguation </SectionTitle> <Paragraph position="0"> The stochastic disambiguation model we employ defines an exponential (a.k.a. log-linear or maximum-entropy) probability model over the parses of the LFG grammar.</Paragraph> <Paragraph position="1"> The advantage of this family of probability distributions is that it allows the user to encode arbitrary properties of the parse trees as feature-functions of the probability model, without the feature-functions needing to be independent and non-overlapping. The general form of conditional exponential models is as follows:</Paragraph> <Paragraph position="3"> where Zl(y) = summationtextx[?]X(y)el*f(x) is a normalizing constant over the set X(y) of parses for sentence y, l is a vector of log-parameters, f is a vector of featurevalues, and l * f(x) is a vector dot product denoting the (log-)weight of parse x.</Paragraph> <Paragraph position="4"> Dynamic-programming algorithms that allow the efficient estimation and searching of log-linear models from a packed parse representation without enumerating an exponential number of parses have been recently presented by Miyao and Tsujii (2002) and Geman and Johnson (2002). These algorithms can be readily applied to the packed and/or-forests of Maxwell and Kaplan (1993), provided that each conjunctive node is annotated with feature-values of the log-linear model. In the notation of Miyao and Tsujii (2002), such a feature forest Ph is defined as a tuple <C,D,r,g,d> where C is a set of conjunctive nodes, D is a set of disjunctive nodes, r [?] C is the root node, g : D - 2C is a conjunctive daughter function, and d : C - 2D is a disjunctive daughter function.</Paragraph> <Paragraph position="5"> A dynamic-programming solution to the problem of finding most probable parses is to compute the weight phd of each disjunctive node as the maximum weight of its conjunctive daugher nodes, i.e.,</Paragraph> <Paragraph position="7"> and to recursively define the weight phc of a conjunctive node as the product of the weights of all its descendant disjunctive nodes and of its own weight:</Paragraph> <Paragraph position="9"> Keeping a trace of the maximally weighted choices in a computaton of the weightphr of the root conjunctive node r allows us to efficiently recover the most probable parse of a sentence from the packed representation of its parses.</Paragraph> <Paragraph position="10"> The same formulae can be employed for an efficient calculation of probabilistic expectations of feature-functions for the statistical estimation of the parameters l. Replacing the maximization in equation 1 by a summation defines the inside weight of disjunctive node. Correspondingly, equation 2 denotes the inside weight of a conjunctive node. The outside weight psc of a conjunctive node is defined as the outside weight of its disjunctive mother node(s):</Paragraph> <Paragraph position="12"> The outside weight of a disjunctive node is the sum of the product of the outside weight(s) of its conjunctive mother(s), the weight(s) of its mother(s), and the inside</Paragraph> <Paragraph position="14"> From these formulae, the conditional expectation of a feature-function fi can be computed from a chart with root node r for a sentence y in the following way:</Paragraph> <Paragraph position="16"> Formula 5 is used in our system to compute expectations for discriminative Bayesian estimation from partially labeled data using a first-order conjugate-gradient routine.</Paragraph> <Paragraph position="17"> For a more detailed description of the optimization problem and the feature-functions we use for stochastic LFG parsing see Riezler et al. (2002). We also employed a combined lscript1 regularization and feature selection technique described in Riezler and Vasserman (2004) that considerably speeds up estimation and guarantees small feature sets for stochastic disambiguation. In the experiments reported in this paper, however, dynamic programming is crucial for efficient stochastic disambiguation, i.e. to efficiently find the most probable parse from a packed parse forest that is annotated with feature-values.</Paragraph> <Paragraph position="18"> There are two operations involved in stochastic disambiguation, namely calculating feature-values from a parse forest and calculating node weights from a feature forest.</Paragraph> <Paragraph position="19"> Clearly, the first one is more expensive, especially for the extraction of values for non-local feature-functions over large charts. To control the cost of this computation, our stochastic disambiguation system includes a user-specified parameter for bounding the amount of work that is done in calculating feature-values. When the user-specified threshold for feature-value calculation is reached, this computation is discontinued, and the dynamic programming calculation for most-probable-parse search is computed from the current feature-value annotation of the parse forest. Since feature-value computation proceeds incrementally over the feature forest, i.e. for each node that is visited all feature-functions that apply to it are evaluated, a complete feature annotation can be guaranteed for the part of the and/or-forest that is visited until discontinuation. As discussed below, these parameters were set on a held-out portion of the PARC700 which was also used to set the Collins parameters.</Paragraph> <Paragraph position="20"> In the experiments reported in this paper, we used a threshold on feature-extraction that allowed us to cut off feature-extraction in 3% of the cases at no loss in accuracy. Overall, feature extraction and weight calculation accounted for 5% of the computation time in combined parsing and stochastic selection.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Gold-Standard Dependency Bank </SectionTitle> <Paragraph position="0"> We used the PARC 700 Dependency Bank (DEPBANK) as the gold standard in our experiments. The DEPBANK consists of dependency annotations for 700 sentences that were randomly extracted from section 23 of the UPenn Wall Street Journal (WSJ) treebank. As described by (King et al., 2003), the annotations were boot-strapped by parsing the sentences with a LFG grammar and transforming the resulting f-structures to a collection of dependency triples in the DEPBANK format. To prepare a true gold standard of dependencies, the tentative set of dependencies produced by the robust parser was then corrected and extended by human validators2. In this format each triple specifies that a particular relation holds between a head and either another head or a feature value, for example, that the SUBJ relation holds between the heads run and dog in the sentence The dog ran. Average sentence length of sentences in DEPBANK is 19.8 words, and the average number of dependencies per sentence is 65.4.</Paragraph> <Paragraph position="1"> The corpus is freely available for research and evaluation, as are documentation and tools for displaying and pruning structures.3 In our experiments we used a Reduced version of the DEPBANK, including just the minimum set of dependencies necessary for reading out the central semantic relations and properties of a sentence. We tested against this Reduced gold standard to establish accuracy on a lower bound of the information that a meaning-sensitive application would require. The Reduced version contained all the argument and adjunct dependencies shown in Fig.</Paragraph> <Paragraph position="2"> 1, and a few selected semantically-relevant features, as shown in Fig. 2. The features in Fig. 2 were chosen be-</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Function Meaning adjunct adjuncts </SectionTitle> <Paragraph position="0"> aquant adjectival quantifiers (many, etc.) comp complement clauses (that, whether) conj conjuncts in coordinate structures focus int fronted element in interrogatives mod noun-noun modifiers number numbers modifying nouns obj objects obj theta secondary objects obl oblique obl ag demoted subject of a passive obl compar comparative than/as clauses poss possessives (John's book) pron int interrogative pronouns pron rel relative pronouns quant quantifiers (all, etc.) subj subjects topic rel fronted element in relative clauses xcomp non-finite complements verbal and small clauses cause it was felt that they were fundamental to the meaning of the sentences, and in fact they are required by the semantic interpreter we have used in a knowledge-based application (Crouch et al., 2002).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Feature Meaning </SectionTitle> <Paragraph position="0"> adegree degree of adjectives and adverbs (positive, comparative, superlative) coord form form of a coordinating conjunction (e.g., and, or) det form form of a determiner (e.g., the, a) num number of nouns (sg, pl) number type cardinals vs. ordinals passive passive verb (e.g., It was eaten.) perf perfective verb (e.g., have eaten) precoord form either, neither prog progressive verb (e.g., were eating) pron form form of a pronoun (he, she, etc.) prt form particle in a particle verb (e.g., They threw it out.) stmt type statement type (declarative, interrogative, etc.) subord form subordinating conjunction (e.g. that) tense tense of the verb (past, present, etc.) Figure 2: Selected features for Reduced DEPBANK .</Paragraph> <Paragraph position="1"> As a concrete example, the dependency list in Fig. 3 is the Reduced set corresponding to the following sentence: He reiterated his opposition to such funding, but expressed hope of a compromise.</Paragraph> <Paragraph position="2"> An additional feature of the DEPBANK that is relevant to our comparisons is that dependency heads are represented by their standard citation forms (e.g. the verb swam in a sentence appears as swim in its dependencies). We believe that most applications will require a conversion to canonical citation forms so that semantic relations can be mapped into application-specific databases or ontologies. The predicates of LFG f-structures are already represented as citation forms; for a fair comparison we ran the leaves of the Collins tree through the same stemmer modules as part of the tree-to-dependency translation. We also note that proper names appear in the DEPBANK as single multi-word expressions without any internal structure. That is, there are no dependencies holding among the parts of people names (A. Boyd Simpson), company names (Goldman, Sachs & Co), and organization names (Federal Reserve). This multiword analysis was chosen because many applications do not require the internal structure of names, and the identification of named entities is now typically carried out by a separate non-syntactic pre-processing module. This was captured for the LFG parser by using named entity markup and for the Collins parser by creating complex word forms with a single POS tag (section 5).</Paragraph> <Paragraph position="4"/> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Conversion to Dependency Bank Format </SectionTitle> <Paragraph position="0"> A conversion routine was required for each system to transform its output so that it could be compared to the DEPBANK dependencies. While it is relatively straightforward to convert LFG f-structures to the dependency bank format because the f-structure is effectively a dependency format, it is more difficult to transform the output trees of the Model 3 Collins parser in a way that fairly allocates both credits and penalties.</Paragraph> <Paragraph position="1"> LFG Conversion We discarded the LFG tree structures and used a general rewriting system previously developed for machine translation to rewrite the relevant f-structure attributes as dependencies (see King et al. (2003)). The rewritings involved some deletions of irrelevant features, some systematic manipulations of the analyses, and some trivial respellings. The deletions involved features produced by the grammar but not included in the PARC 700 such as negative values of PASS, PERF, and PROG and the feature MEASURE used to mark measure phrases. The manipulations are more interesting and are necessary to map systematic differences between the analyses in the grammar and those in the dependency bank. For example, coordination is treated as a set by the LFG grammar but as a single COORD dependency with several CONJ relations in the dependency bank. Finally, the trivial rewritings were used to, for example, change STMT-TYPE decl in the grammar to STMT-TYPE declarative in the dependency bank. For the Reduced version of the PARC 700 substantially more features were deleted.</Paragraph> <Paragraph position="2"> Collins Model 3 Conversion An abbreviated representation of the Collins tree for the example above is shown in Fig. 4. In this display we have eliminated the head lexical items that appear redundantly at all the nonterminals in a head chain, instead indicating by a single number which daughter is the head. Thus, S~2 indicates that the head of the main clause is its second daughter, the VP, and its head is its first VP daughter. Indirectly, then, the lexical head of the S is the first verb reiterated.</Paragraph> <Paragraph position="3"> position to such funding, but expressed hope of a compromise. null The Model 3 output in this example includes standard phrase structure categories, indications of the heads, and the additional -A marker to distinguish arguments from adjuncts. The terminal nodes of this tree are inflected forms, and the first phase of our conversion replaces them with their citation forms (the verbs reiterate and express, and the decapitalized and standardized he for He and his). We also adjust for systematic differences in the choice of heads. The first conjunct tends to be marked as the head of a coordination in Model 3 output, whereas the dependency bank has a more symmetric representation: it introduces a new COORD head and connects that up to the conjunction, and it uses a separate CONJ relation for each of the coordinated items. Similarly, Model 3 identifies the syntactic markers to and that as the heads of complements, whereas the dependency bank treats these as selectional features and marks the main predicate of the complements as the head. These adjustments are carried out without penalty. We also compensate for the differences in the representation of auxiliaries: Model 3 treats these as main verbs with embedded complements instead of the PERF, PROG, and PASSIVE features of the DEP-BANK, and our conversion flattens the trees so that the features can be read off.</Paragraph> <Paragraph position="4"> The dependencies are read off after these and a few other adjustments are made. NPs under VPs are read off either as objects or adjuncts, depending on whether or not the NP is annotated with the argument indicator (-A) as in this example; the -A presumably would be missing in a sentence like John arrived Friday, and Friday would be treated as an ADJUNCT. Similarly, NP-As under S are read off as subject. In this example, however, this principle of conversion does not lead to a match with the dependency bank: in the DEPBANK grammatical relations that are factored out of conjoined structures are distributed back into those structures, to establish the correct semantic dependencies (in this case, that he is the subject of both reiterate and express and not of the introduced coord). We avoided the temptation of building coordinate distribution into the conversion routine because, first, it is not always obvious from the Model 3 output when distribution should take place, and second, that would be a first step towards building into the conversion routine the deep lexical and syntactic knowledge (essentially the functional component of our LFG grammar) that the shallow approach explicitly discounts4.</Paragraph> <Paragraph position="5"> For the same reasons our conversion routine does not identify the subjects of infinitival complements with particular arguments of matrix verbs. The Model 3 trees provide no indication of how this is to be done, and in many cases the proper assignment depends on lexical information about specific predicates (to capture, for example, the well-known contrast between promise and persuade).</Paragraph> <Paragraph position="6"> Model 3 trees also provide information about certain 4However, we did explore a few of these additional transformations and found only marginal F-score increases.</Paragraph> <Paragraph position="7"> long-distance dependencies, by marking with -g annotations the path between a filler and a gap and marking the gap by an explicit TRACE in the terminal string. The filler itself is not clearly identified, but our conversion treats all WH categories under SBAR as potential fillers and attempts to propagate them down the gap-chain to link them up to appropriate traces.</Paragraph> <Paragraph position="8"> In sum, it is not a trivial matter to convert a Model 3 tree to an appropriate set of dependency relations, and the process requires a certain amount of intuition and skill.</Paragraph> <Paragraph position="9"> For our experiments we tried to define a conversion that gives appropriate credit to the dependencies that can be read from the trees without relying on an undue amount of sophisticated linguistic knowledge5.</Paragraph> </Section> class="xml-element"></Paper>