XML Viewer - p00-1061

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1061_metho.xml
Size: 13,212 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1061">
  <Title>Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Property Design and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Lexicalization
3.1 Basic Congurational Properties
</SectionTitle>
      <Paragraph position="0"> The basic 190 properties employed in our models are similar to the properties of Johnson et al. (1999) which incorporate general linguistic principles into a log-linear model. They refer to both the c(onstituent)structure and the f(eature)-structure of the LFG parses. Examples are properties for c-structure nodes, corresponding to standard production properties, c-structure subtrees, indicating argument versus adjunct attachment, f-structure attributes, corresponding to grammatical functions used in LFG, atomic attribute-value pairs in fstructures, null complexity of the phrase being attached to, thus indicating both high and lowattachment, null non-right-branching behavior of nonterminal nodes, non-parallelism of coordinations.</Paragraph>
      <Paragraph position="1"> x 2 X, the expectation q[] corresponds to the empirical expectation ~p[].Ifwe observe incomplete data y 2Y, the expectation q[] is replaced by the conditional expectation ~p[k 0 []] given the observed data y and the current parameter value</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Class-Based Lexicalization
</SectionTitle>
      <Paragraph position="0"> Our approach to grammar lexicalization is class-based in the sense that we use class-based estimated frequencies f</Paragraph>
      <Paragraph position="2"> verbs v and argument head-nouns n instead of pure frequency statistics or class-based probabilities of head word dependencies. Class-based estimated frequencies are introduced in Prescher et al. (2000) as the frequency f(v;;n) of a (v;;n)-pair in the training corpus, weighted by the best estimate of the class-membership probability p(cjv;;n) of an EM-based clustering model on (v;;n)-pairs,</Paragraph>
      <Paragraph position="4"> As is shown in Prescher et al. (2000) in an evaluation on lexical ambiguity resolution, a gain of about 7% can be obtained by using the class-based estimated frequency f</Paragraph>
      <Paragraph position="6"> as disambiguation criterion instead of class-based probabilities p(njv). In order to make the most direct use possible of this fact, we incorporated the decisions of the disambiguator directly into 45 additional properties for the grammatical relations of the subject, direct object, indirect object, innitival object, oblique and adjunctival dative and accusative preposition, for active and passive forms of the rst three verbs in each parse. Let v r (x) be the verbal head of grammatical relation r in parse x, and n</Paragraph>
      <Paragraph position="8"> thus predisambiguates the parses x 2 X(y) of a sentence y according to f c (v;;n), and stores the best parse directly instead of taking the actual estimated frequencies as its value. In Sec. 4, we will see that an incorporation of this pre-disambiguation routine into the models improves performance in disambiguation by about 10%.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="68" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Incomplete Data and Parsebanks
</SectionTitle>
      <Paragraph position="0"> In our experiments, we used an LFG grammar for German  for parsing unrestricted text. Since training was faster than parsing, we parsed in advance and stored the resulting packed c/f-structures. The lowambiguity rate of the German LFG grammar allowed us to restrict the training data to sentences with at most 20 parses. The resulting training corpus of unannotated, incomplete data consists of approximately 36,000 sentences of online available German newspaper text, comprising approximately 250,000 parses.</Paragraph>
      <Paragraph position="1"> In order to compare the contribution of un-ambiguous and ambiguous sentences to the estimation results, we extracted a subcorpus of 4,000 sentences, for which the LFG grammar produced a unique parse, from the full train- null The German LFG grammar is being implemented in the Xerox Linguistic Environment (XLE, see Maxwell and Kaplan (1996)) as part of the Parallel Grammar (ParGram) project at the IMS Stuttgart. The coverage of the grammar is about 50% for unrestricted newspaper text. For the experiments reported here, the eectivecoverage was lower, since the corpus preprocessing we applied was minimal. Note that for the disambiguation task we were interested in, the overall grammar coverage was of subordinate relevance. null ing corpus. The average sentence length of 7.5 for this automatically constructed parsebank is only slightly smaller than that of 10.5 for the full set of 36,000 training sentences and 250,000 parses. Thus, we conjecture that the parsebank includes a representative variety of linguistic phenomena. Estimation from this automatically disambiguated parsebank enjoys the same complete-data estimation properties  as training from manually disambiguated treebanks. This makes a comparison of complete-data estimation from this parsebank to incomplete-data estimation from the full set of training data interesting.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Test Data and Evaluation Tasks
</SectionTitle>
      <Paragraph position="0"> To evaluate our models, we constructed two dierent test corpora. We rst parsed with the LFG grammar 550 sentences which are used for illustrative purposes in the foreign language learner's grammar of Helbig and Buscha (1996). In a next step, the correct parse was indicated by a human disambiguator, according to the reading intended in Helbig and Buscha (1996). Thus a precise  For example, convergence to the global maximum of the complete-data log-likelihood function is guaranteed, which is a good condition for highly precise statistical disambiguation.</Paragraph>
      <Paragraph position="1"> indication of correct c/f-structure pairs was possible. However, the average ambiguity of this corpus is only 5.4 parses per sentence, for sentences with on average 7.5 words. In order to evaluate on sentences with higher ambiguity rate, we manually disambiguated further 375 sentences of LFG-parsed newspaper text.</Paragraph>
      <Paragraph position="2"> The sentences of this corpus have on average 25 parses and 11.2 words.</Paragraph>
      <Paragraph position="3"> We tested our models on two evaluation tasks. The statistical disambiguator was tested on an exact match task, where exact correspondence of the full c/f-structure pair of the hand-annotated correct parse and the most probable parse is checked. Another evaluation was done on a frame match task, where exact correspondence only of the sub-categorization frame of the main verb of the most probable parse and the correct parse is checked. Clearly, the latter task involves a smaller eective ambiguity rate, and is thus to be interpreted as an evaluation of the combined system of highly-constrained symbolic parsing and statistical disambiguation.</Paragraph>
      <Paragraph position="4"> Performance on these two evaluation tasks was assessed according to the following evaluation measures:  cess/failure on the respectiveevaluation tasks;; don't know cases are cases where the system is unable to make a decision, i.e. cases with more than one most probable parse.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="68" type="sub_section">
      <SectionTitle>
4.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> For each task and each test corpus, we calculated a random baseline by averaging over several models with randomly chosen parameter values. This baseline measures the disambiguation power of the pure symbolic parser. The results of an exact-match evaluation on the Helbig-Buscha corpus is shown in Fig. 2. The random baseline was around 33% for this case. The columns list dierent models according to their property-vectors.</Paragraph>
      <Paragraph position="1"> Basic models consist of 190 congurational properties as described in Sec. 3.1. Lexicalized models are extended by 45 lexical pre-disambiguation properties as described in Sec. 3.2. Selected + lexicalized models result from a simple property selection procedure where a cuto on the number of parses with non-negative value of the property-functions was set. Estimation of basic models from complete data gave 68% precision (P), whereas training lexicalized and selected models from incomplete data gave 86.1% precision, which is an improvement of 18%. Comparing lexicalized models in the estimation method shows that incomplete-data estimation gives an improvement of 12% precision over training from the parsebank. A comparison of models trained from incomplete data shows that lexicalization yields a gain of 13% in precision. Note also the gain in eectiveness (E) due to the pre-disambigution routine included in the lexicalized properties. The gain due to property selection both in precision and eectiveness is minimal. A similar pattern of performance arises in an exact match evaluation on the newspaper corpus with an ambiguity rate of 25. The lexicalized and selected model trained from incomplete data achieved here 60.1% precision and 57.9% eectiveness, for a random baseline of around 17%.</Paragraph>
      <Paragraph position="2"> As shown in Fig. 3, the improvement in performance due to both lexicalization and EM training is smaller for the easier task of frame evaluation. Here the random baseline is 70% for frame evaluation on the newspaper corpus with an ambiguity rate of 25. An overall gain of roughly 10% can be achieved by going from unlexicalized parsebank models (80.6% precision) to lexicalized EM-trained models (90% precision). Again, the contribution to this improvement is about the same for lexicalization and incomplete-data training. Applying the same evaluation to the Helbig-Buscha corpus shows 97.6% precision and 96.7% eectiveness for the lexicalized and selected incomplete-data model, compared to around 80% for the random baseline.</Paragraph>
      <Paragraph position="3"> Optimal iteration numbers were decided by repeated evaluation of the models at every fth iteration. Fig. 4 shows the precision of lexicalized and selected models on the exact  match task plotted against the number of iterations of the training algorithm. For parsebank training, the maximal precision value is obtained at 35 iterations. Iterating further shows a clear overtraining eect. For incomplete-data estimation more iterations are necessary to reach a maximal precision value. A comparison of models with random or uniform starting values shows an increase in precision of 10% to 40% for the latter.</Paragraph>
      <Paragraph position="4"> In terms of maximization of likelihood, this corresponds to the fact that uniform starting values immediately push the likelihood up to nearly its nal value, whereas random starting values yield an initial likelihood which has to be increased by factors of 2 to 20 to an often lower nal value.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="68" end_page="68" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> The most direct points of comparison of our method are the approaches of Johnson et al. (1999) and Johnson and Riezler (2000). In the rst approach, log-linear models on LFG grammars using about 200 congurational properties were trained on treebanks of about 400 sentences by maximum pseudo-likelihood estimation. Precision was evaluated on an exact match task in a 10-way cross validation paradigm for an ambiguity rate of 10, and achieved 59% for the rst approach.</Paragraph>
    <Paragraph position="1"> Johnson and Riezler (2000) achieved a gain of 1% over this result by including a class-based lexicalization. Our best models clearly outperform these results, both in terms of precision relative to ambiguity and in terms of relative gain due to lexicalization. A comparison of performance is more dicult for the lexicalized PCFG of Beil et al. (1999) which was trained by EM on 450,000 sentences of German newspaper text. There, a 70.4% precision is reported on a verb frame recognition task on 584 examples. However, the gain achieved by Beil et al. (1999) due to grammar lexicalizaton is only 2%, compared to about 10% in our case. A comparison is dicult also for most other state-of-the-art PCFG-based statistical parsers, since dierent training and test data, and most importantly, dierentevaluation criteria were used. A comparison of the performance gain due to grammar lexicalization shows that our results are on a par with that reported in Charniak (1997).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML