File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1010_metho.xml

Size: 15,290 bytes

Last Modified: 2025-10-06 14:15:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1010">
  <Title>Supervised Grammar Induction using Training Data with Limited Constituent Information *</Title>
  <Section position="4" start_page="74" end_page="74" type="metho">
    <SectionTitle>
3 Training Data Annotation
</SectionTitle>
    <Paragraph position="0"> The training sets are annotated in multiple ways, falling into two categories. First, we construct training sets annotated with random sub-sets of constituents consisting 0%, 25~0, 50%, 75% and 100% of the brackets in the fully annotated corpus. Second, we construct sets training in which only a certain type of constituent is annotated. We study five linguistic categories.</Paragraph>
    <Paragraph position="1"> Table 1 summarizes the annotation differences between the five classes and lists the percentage of brackets in each class with respect to the total number of constituents 1 for ATIS and WSJ. In an AI1NP training set, all and only the noun phrases in the sentences are labeled.</Paragraph>
    <Paragraph position="2"> For the BaseNP class, we label only simple noun phrases that contain no embedded noun phrases. Similarly for a BaseP set, all simple phrases made up of only lexical items are labeled. Although there is a high intersection between the set of BaseP labels and the set of BaseNP labels, the two classes are not identical.</Paragraph>
    <Paragraph position="3"> A BaseNP may contain a BaseP. For the example in Table 1, the phrase &amp;quot;at most one stop&amp;quot; is a BaseNP that contains a quantifier BaseP &amp;quot;at most one.&amp;quot; NotBaseP is the complement of BaseP. The majority of the constituents in a sentence belongs to this category, in which at least one of the constituent's sub-constituents is not a simple lexical item. Finally, in a HighP set, we label only complex phrases that decom1 For computing the percentage of brackets, the outermost bracket around the entire sentence and the brackets around singleton phrases (e.g., the pronoun &amp;quot;r' as a BaseNP) are excluded because they do not contribute to the pruning of parses.</Paragraph>
    <Paragraph position="4"> pose into sub-phrases that may be either another HighP or a BaseP. That is, a HighP constituent does not directly subsume any lexical word. A typical HighP is a sentential clause or a complex noun phrase. The example sentence in Table 1 contains 3 HighP constituents: a complex noun phrase made up of a BaseNP and a prepositional phrase; a sentential clause with an omitted subject NP; and the full sentence.</Paragraph>
  </Section>
  <Section position="5" start_page="74" end_page="75" type="metho">
    <SectionTitle>
4 Induction Strategies
</SectionTitle>
    <Paragraph position="0"> To induce a grammar from the sparsely bracketed training data previously described, we use a variant of the Inside-Outside re-estimation algorithm proposed by Pereira and Schabes (1992). The inferred grammars are represented in the Probabilistic Lexicalized Tree Insertion Grammar (PLTIG) formalism (Schabes and Waters, 1993; Hwa, 1998a), which is lexicalized and context-free equivalent. We favor the PLTIG representation for two reasons. First, it is amenable to the Inside-Outside re-estimation algorithm (the equations calculating the inside and outside probabilities for PLTIGs can be found in Hwa (1998b)). Second, its lexicalized representation makes the training process more efficient than a traditional PCFG while maintaining comparable parsing qualities.</Paragraph>
    <Paragraph position="1"> Two training strategies are considered: direct induction, in which a grammar is induced from scratch, learning from only the sparsely labeled training data; and adaptation, a two-stage learning process that first uses direct induction to train the grammar on an existing fully labeled corpus before retraining it on the new corpus. During the retraining phase, the probabilities of the grammars are re-estimated based on the new training data. We expect the adaptive method to induce better grammars than direct induction when the new corpus is only partially  annotated because the adapted grammars have collected better statistics from the fully labeled data of another corpus.</Paragraph>
  </Section>
  <Section position="6" start_page="75" end_page="75" type="metho">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> We perform two experiments. The first uses ATIS as the corpus from which the different types of partially labeled training sets are generated. Both induction strategies train from these data, but the adaptive strategy pretrains its grammars with fully labeled data drawn from the WSJ corpus. The trained grammars are scored on their parsing abilities on unseen ATIS test sets. We use the non-crossing bracket measurement as the parsing metric. This experiment will show whether annotations of a particular linguistic category may be more useful for training grammars than others. It will also indicate the comparative merits of the two induction strategies trained on data annotated with these linguistic categories. However, pretraining on the much more complex WSJ corpus may be too much of an advantage for the adaptive strategy. Therefore, we reverse the roles of the corpus in the second experiment. The partially labeled data are from the WSJ corpus, and the adaptive strategy is pretrained on fully labeled ATIS data. In both cases, part-of-speech(POS) tags are used as the lexical items of the sentences. Backing off to POS tags is necessary because the tags provide a considerable intersection in the vocabulary sets of the two corpora. null</Paragraph>
    <Section position="1" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
5.1 Experiment 1: Learning ATIS
</SectionTitle>
      <Paragraph position="0"> The easier learning task is to induce grammars to parse ATIS sentences. The ATIS corpus consists of 577 short sentences with simple structures, and the vocabulary set is made up of 32 * POS tags, a subset of the 47 tags used for the WSJ. Due to the limited size of this corpus, ten sets of randomly partitioned train-test-held-out triples are generated to ensure the statistical significance of our results. We use 80 sentences for testing, 90 sentences for held-out data, and the rest for training. Before proceeding with the main discussion on training from the ATIS, we briefly describe the pretraining stage of the adaptive strategy.</Paragraph>
      <Paragraph position="1">  The idea behind the adaptive method is simply to make use of any existing labeled data. We hope that pretraining the grammars on these data might place them in a better position to learn from the new, sparsely labeled data. In the pretraining stage for this experiment, a grammar is directly induced from 3600 fully labeled WSJ sentences. Without any further training on ATIS data, this grammar achieves a parsing score of 87.3% on ATIS test sentences.</Paragraph>
      <Paragraph position="2"> The relatively high parsing score suggests that pretraining with WSJ has successfully placed the grammar in a good position to begin training with the ATIS data.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="75" end_page="78" type="metho">
    <SectionTitle>
ATIS
</SectionTitle>
    <Paragraph position="0"> We now return to the main focus of this experiment: learning from sparsely annotated ATIS training data. To verify whether some constituent classes are more informative than others, we could compare the parsing scores of the grammars trained using different constituent class labels. But this evaluation method does not take into account that the distribution of the constituent classes is not uniform. To normalize for this inequity, we compare the parsing scores to a baseline that characterizes the relationship between the performance of the trained grammar and the number of bracketed constituents in the training data. To generate the baseline, we create training data in which 0%, 25%, 50%, 75%, and 100% of the constituent brackets are randomly chosen to be included.</Paragraph>
    <Paragraph position="1"> One class of linguistic labels is better than another if its resulting parsing improvement over the baseline is higher than that of the other.</Paragraph>
    <Paragraph position="2"> The test results of the grammars induced from these different training data are summarized in Figure 1. Graph (a) plots the outcome of using the direct induction strategy, and graph (b) plots the outcome of the adaptive strategy. In each graph, the baseline of random constituent brackets is shown as a solid line. Scores of grammars trained from constituent type specific data sets are plotted as labeled dots. The dotted horizontal line in graph (b) indicates the ATIS parsing score of the grammar trained on WSJ alone.</Paragraph>
    <Paragraph position="3"> Comparing the five constituent types, we see that the HighP class is the most informative  for the adaptive strategy, resulting in a grammar that scored better than the baseline. The grammars trained on the AllNP annotation performed as well as the baseline for both strategies. Grammars trained under all the other training conditions scored below the baseline.</Paragraph>
    <Paragraph position="4"> Our results suggest that while an ideal training condition would include annotations of both higher-level phrases and simple phrases, complex clauses are more informative. This interpretation explains the large gap between the parsing scores of the directly induced grammar and the adapted grammar trained on the same HighP data. The directly induced grammar performed poorly because it has never seen a labeled example of simple phrases. In contrast, the adapted grammar was already exposed to labeled WSJ simple phrases, so that it successfully adapted to the new corpus from annotated examples of higher-level phrases. On the other hand, training the adapted grammar on annotated ATIS simple phrases is not successful even though it has seen examples of WSJ higher-level phrases. This also explains why grammars trained on the conglomerate class NotBaseP performed on the same level as those trained on the AllNP class. Although the NotBaseP set contains the most brackets, most of the brackets are irrelevant to the training process, as they are neither higher-level phrases nor simple phrases.</Paragraph>
    <Paragraph position="5"> Our experiment also indicates that induction strategies exhibit different learning characteristics under partially supervised training conditions. A side by side comparison of Figure 1 (a) and (b) shows that the adapted grammars perform significantly better than the directly induced grammars as the level of supervision decreases. This supports our hypothesis that pretraining on a different corpus can place the grammar in a good initial search space for learning the new domain. Unfortunately, a good initial state does not obviate the need for supervised training. We see from Figure l(b) that retraining with unlabeled ATIS sentences actually lowers the grammar's parsing accuracy.</Paragraph>
    <Section position="1" start_page="75" end_page="78" type="sub_section">
      <SectionTitle>
5.2 Experiment 2: Learning WSJ
</SectionTitle>
      <Paragraph position="0"> In the previous section, we have seen that annotations of complex clauses are the most helpful for inducing ATIS-style grammars. One of the goals of this experiment is to verify whether the result also holds for the WSJ corpus, which is structurally very different from ATIS. The WSJ corpus uses 47 POS tags, and its sentences are longer and have more embedded clauses.</Paragraph>
      <Paragraph position="1"> As in the previous experiment, we construct training sets with annotations of different constituent types and of different numbers of randomly chosen labels. Each training set consists of 3600 sentences, and 1780 sentences are used as held-out data. The trained grammars are tested on a set of 2245 sentences.</Paragraph>
      <Paragraph position="2"> Figure 2 (a) and (b) summarize the outcomes  function of the number of brackets present in the training corpus. There is a total of 46463 brackets in the training corpus.</Paragraph>
      <Paragraph position="3"> of this experiment. Many results of this section are similar to the ATIS experiment. Higher-level phrases still provide the most information; the grammars trained on the HighP labels are the only ones that scored as well as the baseline. Labels of simple phrases still seem the least informative; scores of grammars trained on BaseP and BaseNP remained far below the baseline.</Paragraph>
      <Paragraph position="4"> Different from the previous experiment, however, the AI1NP training sets do not seem to provide as much information for this learning task. This may be due to the increase in the sentence complexity of the WSJ, which further de-emphasized the role of the simple phrases.</Paragraph>
      <Paragraph position="5"> Thus, grammars trained on AllNP labels have comparable parsing scores to those trained on HighP labels. Also, we do not see as big a gap between the scores of the two induction strategies in the HighP case because the adapted grammar's advantage of having seen annotated ATIS base nouns is reduced. Nonetheless, the adapted grammars still perform 2% better than the directly induced grammars, and this improvement is statistically significant. 2 Furthermore, grammars trained on NotBaseP do not fall as far below the baseline and have higher parsing scores than those trained on HighP and AllNP. This suggests that for more complex domains, other linguistic constituents 2A pair-wise t-test comparing the parsing scores of the ten test sets for the two strategies shows 99% confidence in the difference.</Paragraph>
      <Paragraph position="6"> such as verb phrases 3 become more informative.</Paragraph>
      <Paragraph position="7"> A second goal of this experiment is to test the adaptive strategy under more stringent conditions. In the previous experiment, a WSJ-style grammar was retrained for the simpler ATIS corpus. Now, we reverse the roles of the corpora to see whether the adaptive strategy still offers any advantage over direct induction.</Paragraph>
      <Paragraph position="8"> In the adaptive method's pretraining stage, a grammar is induced from 400 fully labeled ATIS sentences. Testing this ATIS-style grammar on the WSJ test set without further training renders a parsing accuracy of 40%. The low score suggests that fully labeled ATIS data does not teach the grammar as much about the structure of WSJ. Nonetheless, the adaptive strategy proves to be beneficial for learning WSJ from sparsely labeled training sets. The adapted grammars out-perform the directly induced grammars when more than 50% of the brackets are missing from the training data.</Paragraph>
      <Paragraph position="9"> The most significant difference is when the training data contains no label information at all. The adapted grammar parses with 60.1% accuracy whereas the directly induced grammar parses with 49.8% accuracy.</Paragraph>
      <Paragraph position="10"> SV~e have not experimented with training sets containing only verb phrases labels (i.e., setting a pair of bracket around the head verb and its modifiers). They are a subset of the NotBaseP class.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML