XML Viewer - p06-1130

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1130_metho.xml
Size: 22,004 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1130">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Robust PCFG-Based Generation using Automatically Acquired LFG Approximations</Title>
  <Section position="4" start_page="0" end_page="1033" type="metho">
    <SectionTitle>
2 Lexical Functional Grammar
</SectionTitle>
    <Paragraph position="0"> Lexical Functional Grammar (LFG) (Kaplan and Bresnan, 1982) is a constraint-based theory of grammar. It (minimally) posits two levels of representation, c(onstituent)-structure and f(unctional)structure. C-structure is represented by context-free phrase-structure trees, and captures surface</Paragraph>
    <Paragraph position="2"> grammatical configurations such as word order.</Paragraph>
    <Paragraph position="3"> The nodes in the trees are annotated with functional equations (attribute-value structure constraints) which are resolved to produce an fstructure. F-structures are recursive attribute-value matrices, representing abstract syntactic functions. F-structures approximate to basic predicate-argument-adjunct structures or dependency relations. Figure 1 shows the c- and f-structures for the sentence &amp;quot;They believe John resigned&amp;quot;. null</Paragraph>
  </Section>
  <Section position="5" start_page="1033" end_page="1035" type="metho">
    <SectionTitle>
3 PCFG-Based Generation for
Treebank-Based LFG Resources
</SectionTitle>
    <Paragraph position="0"> Cahill et al. (2004) present a method to automatically acquire wide-coverage robust probabilistic LFG approximations1 from treebanks. The method is based on an automatic f-structure annotation algorithm that associates nodes in tree-bank trees with f-structure equations. For each tree, the equations are collected and passed on to a constraint solver which produces an f-structure for the tree. Cahill et al. (2004) present two parsing architectures: the pipeline and the integrated parsing architecture. In the pipeline architecture, a PCFG (or a history-based lexicalised generative parser) is extracted from the treebank and used to parse unseen text into trees, the resulting trees are annotated with f-structure equations by the f-structure annotation algorithm and a constraint solver produces an f-structure. In the in1The resources are approximations in that (i) they do not enforce LFG completeness and coherence constraints and (ii) PCFG-based models can only approximate LFG and similar constraint-based formalisms (Abney, 1997).</Paragraph>
    <Paragraph position="1"> tegrated architecture, first the treebank trees are automatically annotated with f-structure information, f-structure annotated PCFGs with rules of the form NP(|OBJ=|)-DT(|=|) NN(|=|) are extracted, syntactic categories followed by equations are treated as monadic CFG categories during grammar extraction and parsing, unseen text is parsed into trees with f-structure annotations, the annotations are collected and a constraint solver produces an f-structure.</Paragraph>
    <Paragraph position="2"> The generation architecture presented here builds on the integrated parsing architecture resources of Cahill et al. (2004). The generation process takes an f-structure (such as the f-structure on the right in Figure 1) as input and outputs the most likely f-structure annotated tree (such as the tree on the left in Figure 1) given the input f-structure null argmaxTreeP(Tree|F-Str) where the probability of a tree given an f-structure is decomposed as the product of the probabilities of all f-structure annotated productions contributing to the tree but where in addition to conditioning on the LHS of the production (as in the integrated parsing architecture of Cahill et al. (2004)) each production X - Y is now also conditioned on the set of f-structure features Feats ph-linked2 to the LHS of the rule. For an f-structure annotated tree Tree and f-structure  Conditioning F-Structure Features Grammar Rules Probability {PRED, SUBJ, COMP, TENSE} VP(|=|) - VBD(|=|) SBAR(|COMP=|) 0.4998 {PRED, SUBJ, COMP, TENSE} VP(|=|) - VBP(|=|) SBAR(|COMP=|) 0.0366 {PRED, SUBJ, COMP, TENSE} VP(|=|) - VBD(|=|) , S(|COMP=|) 6.48e-6 {PRED, SUBJ, COMP, TENSE} VP(|=|) - VBD(|=|) S(|COMP=|) 3.88e-6 {PRED, SUBJ, COMP, TENSE} VP(|=|) - VBP(|=|) , SBARQ(|COMP=|) 7.86e-7 {PRED, SUBJ, COMP, TENSE} VP(|=|) - VBD(|=|) SBARQ(|COMP=|) 1.59e-7</Paragraph>
    <Paragraph position="4"> and where probabilities are estimated using a simple MLE and rule counts (#) from the automatically f-structure annotated treebank resource of Cahill et al. (2004). Lexical rules (rules expanding preterminals) are conditioned on the full set of (atomic) feature-value pairs ph-linked to the RHS. The intuition for conditioning rules in this way is that local f-structure components of the input f-structure drive the generation process. This conditioning effectively turns the f-structure annotated PCFGs of Cahill et al. (2004) into probabilistic generation grammars. For example, in Figure 1 (where ph-links are represented as arrows), we automatically extract the rule S(|=|) -NP(|SUBJ=|) VP(|=|) conditioned on the feature set {PRED,SUBJ,COMP,TENSE}. The probability of the rule is then calculated by counting the number of occurrences of that rule (and the associated set of features), divided by the number of occurrences of rules with the same LHS and set of features. Table 1 gives example VP rule expansions with their probabilities when we train a grammar from Sections 02-21 of the Penn Treebank.</Paragraph>
    <Section position="1" start_page="1034" end_page="1034" type="sub_section">
      <SectionTitle>
3.1 Chart Generation Algorithm
</SectionTitle>
      <Paragraph position="0"> The generation algorithm is based on chart generation as first introduced by Kay (1996) with Viterbi-pruning. The generation grammar is first converted into Chomsky Normal Form (CNF). We recursively build a chart-like data structure in a bottom-up fashion. In contrast to packing of locally equivalent edges (Carroll and Oepen, 2005), in our approach if two chart items have equivalent rule left-hand sides and lexical coverage, only the most probable one is kept. Each grammatical function-labelled (sub-)f-structure in the overall f-structure indexes a (sub-)chart. The chart for each f-structure generates the most probable tree for that f-structure, given the internal set of conditioning f-structure features and its grammatical function label. At each level, grammatical function indexed charts are initially unordered. Charts are linearised by generation grammar rules once the charts themselves have produced the most probable tree for the chart. Our example in Figure 1 generates the following grammatical function indexed, embedded and (at each level of embedding)</Paragraph>
    </Section>
    <Section position="2" start_page="1034" end_page="1035" type="sub_section">
      <SectionTitle>
3.2 A Worked Example
</SectionTitle>
      <Paragraph position="0"> As an example, we step through the construction of the COMP-indexed chart at level f3 of the f-structure in Figure 1. For lexical rules, we check the feature set at the sub-f-structure level and the values of the features. Only features associated with lexical material are considered. The SUBJ-indexed sub-chart f4 is constructed by first adding the rule NNP(|=|) -John(|PRED='John',|NUM=pl,|PERS=3). If more than one lexical rule corresponds to a particular set of features and values in the f-structure, we add all rules with different LHS categories. If two or more  rules with equal LHS categories match the feature set, we only add the most probable one.</Paragraph>
      <Paragraph position="1"> Unary productions are applied if the RHS of the unary production matches the LHS of an item already in the chart and the feature set of the unary production matches the conditioning feature set of the local sub-f-structure. In our example, this results in the rule NP(|SUBJ=|) - NNP(|=|), conditioned on {NUM, PERS, PRED}, being added to the sub-chart at level f4 (the probability associated with this item is the probability of the rule multiplied by the probability of the previous chart item which combines with the new rule). When a rule is added to the chart, it is automatically associated with the yield of the rule, allowing us to propagate chunks of generated material upwards in the chart. If two items in the chart have the same LHS (and the same yield independent of word order), only the item with the highest probability is kept.</Paragraph>
      <Paragraph position="2"> This Viterbi-style pruning ensures that processing is efficient.</Paragraph>
      <Paragraph position="3"> At sub-chart f4 there are no binary rules that can be applied. At this stage, it is not possible to add any more items to the sub-chart, therefore we propagate items in the chart that are compatible with the sub-chart index SUBJ. In our example, only the rule NP(|SUBJ=|) - NNP(|=|) (which yields the string John) is propagated to the next level up in the overall chart for consideration in the next iteration. If the yield of an item being propagated upwards in the chart is subsumed by an element already at that level, the subsumed item is removed. This results in efficiently treating the well known problem originally described in Kay (1996), where one unnecessarily retains sub-optimal strings. For example, generating the string &amp;quot;The very tall strong athletic man&amp;quot;, one does not want to keep variations such as &amp;quot;The very tall man&amp;quot;, or &amp;quot;The athletic man&amp;quot;, if one can generate the entire string. Our method ensures that only the most probable tree with the longest yield will be propagated upwards.</Paragraph>
      <Paragraph position="4"> The COMP-indexed chart at level f3 of the f-structure is constructed in a similar fashion. First the lexical rule V(|=|) - resigned is added.</Paragraph>
      <Paragraph position="5"> Next, conditioning on {PRED, SUBJ, TENSE}, the unary rule VP(|=|) - V(|=|) (with yield resigned) is added. We combine the new VP(|=|) rule with the NP(|SUBJ=|) already present from the previous iteration to enable us to add the rule S(|=|) - NP(|SUBJ=|) VP(|=|), conditioned on {PRED, SUBJ, TENSE}. The yield of this rule is John resigned. Next, conditioning on the same feature set, we add the rule SBAR(|comp=|) -S(|=|) with yield John resigned to the chart. It is not possible to add any more new rules, so at this stage, only the SBAR(|COMP=|) rule with yield John resigned is propagated up to the next level.</Paragraph>
      <Paragraph position="6"> The process continues until at the outermost level of the f-structure, there are no more rules to be added to the chart. At this stage, we search for the most probable rule with TOP as its LHS category and return the yield of this rule as the output of the generation process. Generation fails if there is no rule with LHS TOP at this level in the chart.</Paragraph>
    </Section>
    <Section position="3" start_page="1035" end_page="1035" type="sub_section">
      <SectionTitle>
3.3 Lexical Smoothing
</SectionTitle>
      <Paragraph position="0"> Currently, the only smoothing in the system applies at the lexical level. Our backoff uses the built-in lexical macros4 of the automatic f-structure annotation algorithm of Cahill et al.</Paragraph>
      <Paragraph position="1"> (2004) to identify potential part-of-speech categories corresponding to a particular set of features. Following Baayen and Sproat (1996) we assume that unknown words have a probability distribution similar to hapax legomena. We add a lexical rule for each POS tag that corresponds to the f-structure features at that level to the chart with a probability computed from the original POS tag probability distribution multiplied by a very small constant. This means that lexical rules seen during training have a much higher probability than lexical rules added during the smoothing phase. Lexical smoothing has the advantage of boosting coverage (as shown in Tables 3, 4, 5 and 6 below) but slightly degrades the quality of the strings generated. We believe that the tradeoff in terms of quality is worth the increase in coverage.</Paragraph>
      <Paragraph position="2"> Smoothing is not carried out when there is no suitable phrasal grammar rule that applies during the process of generation. This can lead to the generation of partial strings, since some f-structure components may fail to generate a corresponding string. In such cases, generation outputs the concatenation of the strings generated by the remaining components.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1035" end_page="1037" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We train our system on WSJ Sections 02-21 of  tures, for example the tag NNS (plural noun) is associated with the features |PRED=$LEMMA and |NUM=pl.  development set. As part of our evaluation, we experiment with sentences of varying length (20, 25, 30, 40, all), both in training and testing. Table 2 gives the number of training and test sentences for each sentence length. In each case, we use the automatically generated f-structures from Cahill et al. (2004) from the original Section 23 treebank trees as f-structure input to our generation experiments. We automatically mark adjunct and coordination scope in the input f-structure. Notice that these automatically generated f-structures are not &amp;quot;perfect&amp;quot;, i.e. they are not guaranteed to be complete and coherent (Kaplan and Bresnan, 1982): a local f-structure may contain material that is not supposed to be there (incoherence) and/or may be missing material that is supposed to be there (incompleteness). The results presented below show that our method is robust with respect to the quality of the f-structure input and will always attempt to generate partial output rather than fail. We consider this an important property as pristine generation input cannot always be guaranteed in realistic application scenarios, such as probabilistic transfer-based machine translation where generation input may contain a certain amount of noise.</Paragraph>
    <Section position="1" start_page="1036" end_page="1036" type="sub_section">
      <SectionTitle>
4.1 Pre-Training Treebank Transformations
</SectionTitle>
      <Paragraph position="0"> During the development of the generation system, we carried out error analysis on our development set WSJ Section 22 of the Penn-II Treebank. We identified some initial pre-training transformations to the treebank that help generation.</Paragraph>
      <Paragraph position="1"> Punctuation: Punctuation is not usually encoded in f-structure representations. Because our architecture is completely driven by rules conditioned by f-structure information automatically extracted from an f-structure annotated treebank, its placement of punctuation is not principled.</Paragraph>
      <Paragraph position="2"> This led to anomalies such as full stops appearing mid sentence and quotation marks appearing in undesired locations. One partial solution to this was to reduce the amount of punctuation that the system trained on. We removed all punctuation apart from commas and full stops from the training data. We did not remove any punctuation from the evaluation test set (Section 23), but our system will ever only produce commas and full stops. In the evaluation (Tables 3, 4, 5 and 6) we are penalised for the missing punctuation. To solve the problem of full stops appearing mid sentence, we carry out a punctuation post-processing step on all generated strings. This removes mid-sentence full stops and adds missing full stops at the end of generated sentences prior to evaluation. We are working on a more appropriate solution allowing the system to generate all punctuation.</Paragraph>
      <Paragraph position="3"> Case: English does not have much case marking, and for parsing no special treatment was encoded. However, when generating, it is very important that the first person singular pronoun is I in the nominative case and me in the accusative. Given the original grammar used in parsing, our generation system was not able to distinguish nominative from accusative contexts. The solution we implemented was to carry out a grammar transformation in a pre-processing step, to automatically annotate personal pronouns with their case information. This resulted in phrasal and lexical rules such as NP(|SUBJ) - PRP^nom(|=|) and PRP^nom(|=|) - I and greatly improved the accuracy of the pronouns generated.</Paragraph>
    </Section>
    <Section position="2" start_page="1036" end_page="1037" type="sub_section">
      <SectionTitle>
4.2 String-Based Evaluation
</SectionTitle>
      <Paragraph position="0"> We evaluate the output of our generation system against the raw strings of Section 23 using the Simple String Accuracy and BLEU (Papineni et al., 2002) evaluation metrics. Simple String Accuracy is based on the string edit distance between the output of the generation system and the gold standard sentence. BLEU is the weighted average of n-gram precision against the gold standard sentences. We also measure coverage as the percentage of input f-structures that generate a string. For evaluation, we automatically expand all contracted words. We only evaluate strings produced by the system (similar to Nakanishi et al. (2005)).</Paragraph>
      <Paragraph position="1"> We conduct a total of four experiments. The parameters we investigate are lexical smoothing (Section 3.3) and partial output. Partial output is a robustness feature for cases where a sub-f-structure component fails to generate a string and the system outputs a concatenation of the strings generated by the remaining components, rather than fail completely.</Paragraph>
      <Paragraph position="2">  Varying the length of the sentences included in the training data (Tables 3 and 5) shows that results improve (both in terms of coverage and string quality) as the length of sentence included in the training data increases.</Paragraph>
      <Paragraph position="3"> Tables 3 and 5 give the results for the experiments including lexical smoothing and varying partial output. Table 3 (+partial, +smoothing) shows that training on sentences of all lengths and evaluating all strings (including partial outputs), our system achieves coverage of 98.05%, a BLEU score of 0.6651 and string accuracy of 0.6808. Table 5 (-partial, +smoothing) shows that coverage drops to 89.49%, BLEU score increases to 0.6979 and string accuracy to 0.7012, when the system is trained on sentences of all lengths. Similarly, for strings [?]20, coverage drops from 98.65% to 95.26%, BLEU increases from 0.7077 to 0.7227 and String Accuracy from 0.7373 to 0.7476. Including partial output increases coverage (by more than 8.5 percentage points for all sentences) and hence robustness while slightly decreasing quality.</Paragraph>
      <Paragraph position="4"> Tables 3 (+partial, +smoothing) and 4 (+partial, -smoothing) give results for the experiments including partial output but varying lexical smoothing. With no lexical smoothing (Table 4), the system (trained on all sentence lengths) produces strings for 90.11% of the input f-structures and achieves a BLEU score of 0.5590 and string accuracy of 0.6207. Switching off lexical smoothing has a negative effect on all evaluation metrics (coverage and quality), because many more strings produced are now partial (since for PRED values unseen during training, no lexical entries are added to the chart).</Paragraph>
      <Paragraph position="5"> Comparing Tables 5 (-partial, +smoothing) and 6 (-partial, -smoothing), where the system does not produce any partial outputs and lexical smoothing is varied, shows that training on all sentence lengths, BLEU score increases from 0.6979 to 0.7147 and string accuracy increases from 0.7012 to 0.7192. At the same time, coverage drops dramatically from 89.49% (Table 5) to 47.60% (Table 6).</Paragraph>
      <Paragraph position="6"> Comparing Tables 4 and 6 shows that while partial output almost doubles coverage, this comes at a price of a severe drop in quality (BLEU score drops from 0.7147 to 0.5590). On the other hand, comparing Tables 5 and 6 shows that lexical smoothing achieves a similar increase in coverage with only a very slight drop in quality.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1037" end_page="1038" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Nakanishi et al. (2005) achieve 90.56% coverage and a BLEU score of 0.7723 on Section 23  sentences, restricted to length [?]20 for efficiency reasons. Langkilde-Geary's (2002) best system achieves 82.8% coverage, a BLEU score of 0.924 and string accuracy of 0.945 against Section 23 sentences of all lengths. Callaway (2003) achieves 98.7% coverage and a string accuracy of 0.6607 on sentences of all lengths. Our best results for sentences of length [?] 20 are coverage of 95.26%, BLEU score of 0.7227 and string accuracy of 0.7476. For all sentence lengths, our best results are coverage of 89.49%, a BLEU score of 0.6979 and string accuracy of 0.7012.</Paragraph>
    <Paragraph position="1"> Using hand-crafted grammar-based generation systems (Langkilde-Geary, 2002; Callaway, 2003), it is possible to achieve very high results. However, hand-crafted systems are expensive to construct and not easily ported to new domains or other languages. Our methodology, on the other hand, is based on resources automatically acquired from treebanks and easily ported to new domains and languages, simply by retraining on suitable data. Recent work on the automatic acquisition of multilingual LFG resources from treebanks for Chinese, German and Spanish (Burke et al., 2004; Cahill et al., 2005; O'Donovan et al., 2005) has shown that given a suitable treebank, it is possible to automatically acquire high quality LFG resources in a very short space of time. The generation architecture presented here is easily ported to those different languages and treebanks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML