XML Viewer - p02-1035

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1035_metho.xml
Size: 20,384 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1035">
  <Title>Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Robust Parsing using LFG
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 A Broad-Coverage LFG
</SectionTitle>
      <Paragraph position="0"> The grammar used for this project was developed in the ParGram project (Butt et al., 1999). It uses LFG as a formalism, producing c(onstituent)-structures (trees) and f(unctional)-structures (attribute value matrices) as output. The c-structures encode constituency. F-structures encode predicate-argument relations and other grammatical information, e.g., number, tense. The XLE parser (Maxwell and Kaplan, 1993) was used to produce packed representations, specifying all possible grammar analyses of the input.</Paragraph>
      <Paragraph position="1"> The grammar has 314 rules with regular expression right-hand sides which compile into a collection of finite-state machines with a total of 8,759 states and 19,695 arcs. The grammar uses several lexicons and two guessers: one guesser for words recognized by the morphological analyzer but not in the lexicons and one for those not recognized.</Paragraph>
      <Paragraph position="2"> As such, most nouns, adjectives, and adverbs have no explicit lexical entry. The main verb lexicon contains 9,652 verb stems and 23,525 subcategorization frame-verb stem entries; there are also lexicons for adjectives and nouns with subcategorization frames and for closed class items.</Paragraph>
      <Paragraph position="3"> For estimation purposes using the WSJ treebank, the grammar was modified to parse part of speech tags and labeled bracketing. A stripped down version of the WSJ treebank was created that used only those POS tags and labeled brackets relevant for determining grammatical relations. The WSJ labeled brackets are given LFG lexical entries which constrain both the c-structure and the f-structure of the parse. For example, the WSJ's ADJP-PRD label must correspond to an AP in the c-structure and an XCOMP in the f-structure. In this version of the corpus, all WSJ labels with -SBJ are retained and are restricted to phrases corresponding to SUBJ in the LFG grammar; in addition, it contains NP under VP (OBJ and OBJth in the LFG grammar), all -LGS tags (OBL-AG), all -PRD tags (XCOMP), VP under VP (XCOMP), SBAR- (COMP), and verb POS tags under VP (V in the c-structure). For example, our labeled bracketing of wsj 1305.mrg is [NP-SBJ His credibility] is/VBZ also [PP-PRD on the line] in the investment community.</Paragraph>
      <Paragraph position="4"> Some mismatches between the WSJ labeled bracketing and the LFG grammar remain. These often arise when a given constituent fills a grammatical role in more than one clause. For example, in wsj 1303.mrg Japan's Daiwa Securities Co. named Masahiro Dozen president., the noun phrase Masahiro Dozen is labeled as an NP-SBJ. However, the LFG grammar treats it as the OBJ of the matrix clause. As a result, the labeled bracketed version of this sentence does not receive a full parse, even though its unlabeled, string-only counterpart is wellformed. Some other bracketing mismatches remain, usually the result of adjunct attachment. Such mismatches occur in part because, besides minor modifications to match the bracketing for special constructions, e.g., negated infinitives, the grammar was not altered to mirror the idiosyncrasies of the WSJ bracketing.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Robustness Techniques
</SectionTitle>
      <Paragraph position="0"> To increase robustness, the standard grammar has been augmented with a FRAGMENT grammar. This grammar parses the sentence as well-formed chunks specified by the grammar, in particular as Ss, NPs, PPs, and VPs. These chunks have both c-structures and f-structures corresponding to them. Any token that cannot be parsed as one of these chunks is parsed as a TOKEN chunk. The TOKENs are also recorded in the c- and f-structures. The grammar has a fewest-chunk method for determining the correct parse. For example, if a string can be parsed as two NPs and a VP or as one NP and an S, the NP-S option is chosen. A sample FRAGMENT c-structure and f-structure are shown in Fig. 1 for wsj 0231.mrg (The golden share was scheduled to expire at the beginning of), an incomplete sentence; the parser builds one S chunk and then one TOKEN for the stranded preposition.</Paragraph>
      <Paragraph position="1"> A final capability of XLE that increases coverage of the standard-plus-fragment grammar is a SKIMMING technique. Skimming is used to avoid timeouts and memory problems. When the amount of time or memory spent on a sentence exceeds a threshhold, XLE goes into skimming mode for the constituents whose processing has not been completed. When XLE skims these remaining constituents, it does a bounded amount of work per subtree. This guarantees that XLE finishes processing a sentence in a polynomial amount of time. In parsing section 23, 7.2% of the sentences were skimmed; 26.1% of these resulted in full parses, while 73.9% were FRAGMENT parses.</Paragraph>
      <Paragraph position="2"> The grammar coverage achieved 100% of section 23 as unseen unlabeled data: 74.7% as full parses,</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
25.3% FRAGMENT and/or SKIMMED parses.
3 Discriminative Statistical Estimation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
from Partially Labeled Data
3.1 Exponential Models on LFG Parses
</SectionTitle>
      <Paragraph position="0"> We employed the well-known family of exponential models for stochastic disambiguation. In this paper we are concerned with conditional exponential models of the form:</Paragraph>
      <Paragraph position="2"> stant, = ( 1;::: ; n) 2 IRn is a vector of log-parameters, f = (f1;::: ;fn) is a vector of property-functions fi : X ! IR for i = 1;::: ;n on the set of parsesX, and f(x) is the vector dot productPni=1 ifi(x).</Paragraph>
      <Paragraph position="3"> In our experiments, we used around 1000 complex property-functions comprising information about c-structure, f-structure, and lexical elements in parses, similar to the properties used in Johnson et al. (1999). For example, there are property functions for c-structure nodes and c-structure subtrees, indicating attachment preferences. High versus low attachment is indicated by property functions counting the number of recursively embedded phrases.</Paragraph>
      <Paragraph position="4"> Other property functions are designed to refer to f-structure attributes, which correspond to grammatical functions in LFG, or to atomic attribute-value pairs in f-structures. More complex property functions are designed to indicate, for example, the branching behaviour of c-structures and the (non)parallelism of coordinations on both c-structure and f-structure levels. Furthermore, properties refering to lexical elements based on an auxiliary distribution approach as presented in Riezler et al. (2000) are included in the model. Here tuples of head words, argument words, and grammatical relations are extracted from the training sections of the WSJ, and fed into a finite mixture model for clustering grammatical relations. The clustering model itself is then used to yield smoothed probabilities as values for property functions on head-argument-relation tuples of LFG parses.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Discriminative Estimation
</SectionTitle>
      <Paragraph position="0"> Discriminative estimation techniques have recently received great attention in the statistical machine learning community and have already been applied to statistical parsing (Johnson et al., 1999; Collins, 2000; Collins and Duffy, 2001). In discriminative estimation, only the conditional relation of an analysis given an example is considered relevant, whereas in maximum likelihood estimation the joint probability of the training data to best describe observations is maximized. Since the discriminative task is kept in mind during estimation, discriminative methods can yield improved performance. In our case, discriminative criteria cannot be defined directly with respect to &amp;quot;correct labels&amp;quot; or &amp;quot;gold standard&amp;quot; parses since the WSJ annotations are not sufficient to disambiguate the more complex LFG parses. However, instead of retreating to unsupervised estimation techniques or creating small LFG treebanks by hand, we use the labeled bracketing of the WSJ training sections to guide discriminative estimation. That is, discriminative criteria are defined with respect to the set of parses consistent with the WSJ annotations.1 The objective function in our approach, denoted by P( ), is the joint of the negative log-likelihood L( ) and a Gaussian regularization term G( ) on the parameters . Let f(yj;zj)gmj=1 be a set of training data, consisting of pairs of sentences y and partial annotations z, let X(y;z) be the set of parses for sentence y consistent with annotation z, and let X(y) be the set of all parses produced by the grammar for sentence y. Furthermore, let p[f] denote the expectation of function f under distribution p. Then P( ) can be defined for a conditional exponential model p (zjy) as:</Paragraph>
      <Paragraph position="2"> Intuitively, the goal of estimation is to find model pa1An earlier approach using partially labeled data for estimating stochastics parsers is Pereira and Schabes's (1992) work on training PCFG from partially bracketed data. Their approach differs from the one we use here in that Pereira and Schabes take an EM-based approach maximizing the joint likelihood of the parses and strings of their training data, while we maximize the conditional likelihood of the sets of parses given the corresponding strings in a discriminative estimation setting.</Paragraph>
      <Paragraph position="3"> rameters which make the two expectations in the last equation equal, i.e. which adjust the model parameters to put all the weight on the parses consistent with the annotations, modulo a penalty term from the Gaussian prior for too large or too small weights.</Paragraph>
      <Paragraph position="4"> Since a closed form solution for such parameters is not available, numerical optimization methods have to be used. In our experiments, we applied a conjugate gradient routine, yielding a fast converging optimization algorithm where at each iteration the negative log-likelihood P( ) and the gradient vector have to be evaluated.2 For our task the gradient takes the form:</Paragraph>
      <Paragraph position="6"> The derivatives in the gradient vector intuitively are again just a difference of two expectations</Paragraph>
      <Paragraph position="8"> Note also that this expression shares many common terms with the likelihood function, suggesting an efficient implementation of the optimization routine.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Training
</SectionTitle>
      <Paragraph position="0"> The basic training data for our experiments are sections 02-21 of the WSJ treebank. As a first step, all sections were parsed, and the packed parse forests unpacked and stored. For discriminative estimation, this data set was restricted to sentences which receive a full parse (in contrast to a FRAGMENT or SKIMMED parse) for both its partially labeled and its unlabeled variant. Furthermore, only sentences 2An alternative numerical method would be a combination of iterative scaling techniques with a conditional EM algorithm (Jebara and Pentland, 1998). However, it has been shown experimentally that conjugate gradient techniques can outperform iterative scaling techniques by far in running time (Minka, 2001). which received at most 1,000 parses were used.</Paragraph>
      <Paragraph position="1"> From this set, sentences of which a discriminative learner cannot possibly take advantage, i.e. sentences where the set of parses assigned to the partially labeled string was not a proper subset of the parses assigned the unlabeled string, were removed.</Paragraph>
      <Paragraph position="2"> These successive selection steps resulted in a final training set consisting of 10,000 sentences, each with parses for partially labeled and unlabeled versions. Altogether there were 150,000 parses for partially labeled input and 500,000 for unlabeled input.</Paragraph>
      <Paragraph position="3"> For estimation, a simple property selection procedure was applied to the full set of around 1000 properties. This procedure is based on a frequency cutoff on instantiations of properties for the parses in the labeled training set. The result of this procedure is a reduction of the property vector to about half its size. Furthermore, a held-out data set was created from section 24 of the WSJ treebank for experimental selection of the variance parameter of the prior distribution. This set consists of 120 sentences which received only full parses, out of which the most plausible one was selected manually.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Testing
</SectionTitle>
      <Paragraph position="0"> Two different sets of test data were used: (i) 700 sentences randomly extracted from section 23 of the WSJ treebank and given gold-standard f-structure annotations according to our LFG scheme, and (ii) 500 sentences from the Brown corpus given gold standard annotations by Carroll et al. (1999) according to their dependency relations (DR) scheme.3 Annotating the WSJ test set was bootstrapped by parsing the test sentences using the LFG grammar and also checking for consistency with the Penn Treebank annotation. Starting from the (sometimes fragmentary) parser analyses and the Tree-bank annotations, gold standard parses were created by manual corrections and extensions of the LFG parses. Manual corrections were necessary in about half of the cases. The average sentence length of the WSJ f-structure bank is 19.8 words; the average number of predicate-argument relations in the gold-standard f-structures is 31.2.</Paragraph>
      <Paragraph position="1"> Performance on the LFG-annotated WSJ test set 3Both corpora are available online. The WSJ f-structure bank at www.parc.com/istl/groups/nltt/fsbank/, and Carroll et al.'s corpus at www.cogs.susx.ac.uk/lab/nlp/carroll/greval.html. was measured using both the LFG and DR metrics, thanks to an f-structure-to-DR annotation mapping.</Paragraph>
      <Paragraph position="2"> Performance on the DR-annotated Brown test set was only measured using the DR metric.</Paragraph>
      <Paragraph position="3"> The LFG evaluation metric is based on the comparison of full f-structures, represented as triples relation(predicate;argument). The predicate-argument relations of the f-structure for one parse of the sentence Meridian will pay a premium of $30.5 million to assume $2 billion in deposits. are shown in Fig. 2.</Paragraph>
      <Paragraph position="5"> The DR annotation for our example sentence, obtained via a mapping from f-structures to Carroll et al's annotation scheme, is shown in Fig. 3.</Paragraph>
      <Paragraph position="6"> (aux pay will) (subj pay Meridian ) (detmod premium a) (mod million 30.5) (mod $ million) (mod of premium $) (dobj pay premium ) (mod billion 2)  relation representation Superficially, the LFG and DR representations are very similar. One difference between the annotation schemes is that the LFG representation in general specifies more relation tuples than the DR representation. Also, multiple occurences of the same lexical item are indicated explicitly in the LFG representation but not in the DR representation. The main conceptual difference between the two annotation schemes is the fact that the DR scheme crucially refers to phrase-structure properties and word order as well as to grammatical relations in the definition of dependency relations, whereas the LFG scheme abstracts away from serialization and phrase-structure. Facts like this can make a correct mapping of LFG f-structures to DR relations problematic. Indeed, we believe that we still underestimate by a few points because of DR mapping difficulties. 4</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Results
</SectionTitle>
      <Paragraph position="0"> In our evaluation, we report F-scores for both types of annotation, LFG and DR, and for three types of parse selection, (i) lower bound: random choice of a parse from the set of analyses (averaged over 10 runs), (ii) upper bound: selection of the parse with the best F-score according to the annotation scheme used, and (iii) stochastic: the parse selected by the stochastic disambiguator. The error reduction row lists the reduction in error rate relative to the upper and lower bounds obtained by the stochastic disambiguation model. F-score is defined as 2 precision recall=(precision+recall).</Paragraph>
      <Paragraph position="1"> Table 1 gives results for 700 examples randomly selected from section 23 of the WSJ treebank, using both LFG and DR measures.</Paragraph>
      <Paragraph position="2">  selected examples from section 23 of the WSJ tree-bank using LFG and DR measures.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
LFG DR
</SectionTitle>
    <Paragraph position="0"> upper bound 84.1 80.7 stochastic 78.6 73.0 lower bound 75.5 68.8 error reduction 36 35 The effect of the quality of the parses on disambiguation performance can be illustrated by breaking down the F-scores according to whether the parser yields full parses, FRAGMENT, SKIMMED, or SKIMMED+FRAGMENT parses for the test sentences. The percentages of test examples which belong to the respective classes of quality are listed in the first row of Table 2. F-scores broken down according to classes of parse quality are recorded in the follow4See Carroll et al. (1999) for more detail on the DR annotation scheme, and see Crouch et al. (2002) for more detail on the differences between the DR and the LFG annotation schemes, as well as on the difficulties of the mapping from LFG f-structures to DR annotations.</Paragraph>
    <Paragraph position="1"> ing rows. The first column shows F-scores for all parses in the test set, as in Table 1. The second column shows the best F-scores when restricting attention to examples which receive only full parses. The third column reports F-scores for examples which receive only non-full parses, i.e. FRAGMENT or SKIMMED parses or SKIMMED+FRAGMENT parses.</Paragraph>
    <Paragraph position="2"> Columns 4-6 break down non-full parses according to examples which receive only FRAGMENT, only SKIMMED, or only SKIMMED+FRAGMENT parses.</Paragraph>
    <Paragraph position="3"> Results of the evaluation on Carroll et al.'s Brown test set are given in Table 3. Evaluation results for the DR measure applied to the Brown corpus test set broken down according to parse-quality are shown in Table 2.</Paragraph>
    <Paragraph position="4"> In Table 3 we show the DR measure along with an evaluation measure which facilitates a direct comparison of our results to those of Carroll et al. (1999). Following Carroll et al. (1999), we count a dependency relation as correct if the gold standard has a relation with the same governor and dependent but perhaps with a different relation-type. This dependency-only (DO) measure thus does not reflect mismatches between arguments and modifiers in a small number of cases. Note that since for the evaluation on the Brown corpus, no heldout data were available to adjust the variance parameter of a Bayesian model, we used a plain maximum-likelihood model for disambiguation on this test set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML