XML Viewer - p04-1058

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1058_metho.xml
Size: 24,714 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1058">
  <Title>Alternative Approaches for Generating Bodies of Grammar Rules</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Grammatical Framework
</SectionTitle>
    <Paragraph position="0"> We briefly detail the grammars we work with (PCW-grammars), how automata give rise to these grammars, and how we parse using them.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 PCW-Grammars
</SectionTitle>
      <Paragraph position="0"> We need a grammatical framework that models rule bodies as instances of a regular language and that allows us to transform automata to grammars as directly as possible. We decided to embed them in the general grammatical framework of CW-grammars (Infante-Lopez and de Rijke, 2003): based on PCFGs, they have a clear and well-understood mathematical background and we do not need to implement ad-hoc parsing algorithms.</Paragraph>
      <Paragraph position="1"> A probabilistic constrained W-grammar (PCWgrammar) consists of two different sets of PCF-like rules called pseudo-rules and meta-rules respectively and three pairwise disjoint sets of symbols: variables, non-terminals and terminals. Pseudorules and meta-rules provide mechanisms for building 'real' rewrite rules. We use a w== b to indicate that a should be rewritten as b. In the case of PCWgrammars, rewrite rules are built by first selecting a pseudo-rule, and then using meta-rules for instantiating all the variables in the body of the pseudo-rule. To illustrate these concepts, we provide an example. Let W = (V,NT,T,S, m[?]-, s[?]-) be a CW-grammar such that the set of variable, non-terminals meta-rules pseudo-rules Adj m[?]-0.5 AdjAdj S s[?]-1 AdjNoun Adj m[?]-0.5 Adj Adj s[?]-0.1 big Noun s[?]-1 ball ...</Paragraph>
      <Paragraph position="2"> and terminals are defined as follows: V = {Adj}, NT = {S, Adj, Noun}, T = {ball, big, fat, red, green, ...}. As usual, the numbers attached to the arrows indicate the probabilities of the rules. The rules defined by W have the following shape: S w== Adj[?] Noun. Suppose now that we want to build the rule S w== Adj Adj Noun. We take the pseudo-rule S s[?]-1 Adj Noun and instantiate the variable Adj with Adj Adj to get the desired rule.</Paragraph>
      <Paragraph position="3"> The probability for it is 1 x 0.5 x 0.5, that is, the probability of the derivation for Adj Adj times the probability of the pseudo-rule used. Trees for this particular grammar are flat, with a main node S and all the adjectives in it as daughters. An example derivation is given in Figure 1(a).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 From Automata to Grammars
</SectionTitle>
      <Paragraph position="0"> Now that we have introduced PCW-grammars, we describe how we build them from the automata that we are going to induce in Section 4. Since we will induce two families of automata (&amp;quot;Many-Automata&amp;quot; where we use two automata per POS, and &amp;quot;One-Automaton&amp;quot; where we use only two automata to fit every POS), we need to describe two automata-to-grammar transformations.</Paragraph>
      <Paragraph position="1"> Let's start with the case where we build two automata per POS. Let w be a POS in the PTB; let AwL and AwR be the two automata associated to it. Let GwL and GwR be the PCFGs equivalent to AwL and AwR, respectively, following (Abney et al., 1999), and let SwL and SwR be the starting symbols of GwL and GwR, respectively. We build our final grammar G with starting symbol S, by defining its meta-rules as the disjoint union of all rules in GwL and GwR (for all POS w), its set of pseudo-rules as the union of the sets {W s[?]-1 SwLwSwR and S s[?]-1 SwLwSwR}, where W is a unique new variable symbol associated to w.</Paragraph>
      <Paragraph position="2"> When we use two automata for all parts of speech, the grammar is defined as follows. Let AL and AR be the two automata learned. Let GL and GR be the PCFGs equivalent to AL and AR, and let SL and SR be the starting symbols of GL and GR, respectively. Fix a POS w in the PTB. Since the automata are deterministic, there exist states SwL and SwR that are reachable from SL and SR, respectively, by following the arc labeled with w. Define a grammar as in the previous case. Its starting symbol is S, its set of meta-rules is the disjoint union of all rules in GwL and GwR (for all POS w), its set of pseudo-rules is {W s[?]-1 SwLwSwR,S s[?]-1 SwLwSwR : w is a POS in the PTB and W is a unique new variable symbol associated to w}.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Parsing PCW-Grammars
</SectionTitle>
      <Paragraph position="0"> Parsing PCW-grammars requires two steps: a generation-rule step followed by a tree-building step. We now explain how these two steps can be carried out in one go. Parsing with PCW-grammars can be viewed as parsing with PCF grammars. The main difference is that in PCW-parsing derivations for variables remain hidden in the final tree. To clarify this, consider the trees depicted in Figure 1; the tree in part (a) is the CW-tree corresponding to the word red big green ball, and the tree in part (b) is the same tree but now the instantiations of the meta-rules that were used have been made visible.</Paragraph>
      <Paragraph position="1">  tree with meta-rule derivations made visible.</Paragraph>
      <Paragraph position="2"> To adapt a PCFG to parse CW-grammars, we need to define a PCF grammar for a given PCWgrammar by adding the two sets of rules while making sure that all meta-rules have been marked somehow. In Figure 1(b) the head symbols of meta-rules have been marked with the superscript 1. After parsing the sentence with the PCF parser, all marked rules should be collapsed as shown in part (a).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="30" type="metho">
    <SectionTitle>
4 Building Automata
</SectionTitle>
    <Paragraph position="0"> The four grammars we intend to induce are completely defined once the underlying automata have been built. We now explain how we build those automata from the training material. We start by detailing how the material is generated.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Building the Sample Sets
</SectionTitle>
      <Paragraph position="0"> We transform the PTB, sections 2-22, to dependency structures, as suggested by (Collins, 1999).</Paragraph>
      <Paragraph position="1"> All sentences containing CC tags are filtered out, following (Eisner, 1996). We also eliminate all word information, leaving only POS tags. For each resulting dependency tree we extract a sample set of right and left sequences of dependents as shown in Figure 2. From the tree we generate a sample set with all right sequences of dependents {epsilon1,epsilon1,epsilon1}, and another with all left sequences {epsilon1,epsilon1,red big green}. The sample set used for automata induction is the union of all individual tree sample sets.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Learning Probabilistic Automata
</SectionTitle>
      <Paragraph position="0"> Probabilistic deterministic finite state automata (PDFA) inference is the problem of inducing a stochastic regular grammar from a sample set of strings belonging to an unknown regular language.</Paragraph>
      <Paragraph position="1"> The most direct approach for solving the task is by  adds a state to the resulting automaton for each sequence of symbols of length n it has seen in the training material; it also adds an arc between states ab and bb labeled b, if the sequence abb appears in the training set. The probability assigned to the arc (ab,bb) is proportional to the number of times the sequence abb appears in the training set. For the remainder, we take n-grams to be bigrams.</Paragraph>
      <Paragraph position="2"> There are other approaches to inducing regular grammars besides ones based on n-grams. The first algorithm to learn PDFAs was ALERGIA (Carrasco and Oncina, 1994); it learns cyclic automata with the so-called state-merging method. The Minimum Discrimination Information (MDI) algorithm (Thollard et al., 2000) improves over ALERGIA and uses Kullback-Leibler divergence for deciding when to merge states. We opted for the MDI algorithm as an alternative to n-gram based induction algorithms, mainly because their working principles are radically different from the n-gram-based algorithm.</Paragraph>
      <Paragraph position="3"> The MDI algorithm first builds an automaton that only accepts the strings in the sample set by merging common prefixes, thus producing a tree-shaped automaton in which each transition has a probability proportional to the number of times it is used while generating the positive sample.</Paragraph>
      <Paragraph position="4"> The MDI algorithm traverses the lattice of all possible partitions for this general automaton, attempting to merge states that satisfy a trade-off that can be specified by the user. Specifically, assume that A1 is a temporary solution of the algorithm and that A2 is a tentative new solution derived from A1. [?](A1,A2) = D(A0||A2) [?] D(A0||A1) denotes the divergence increment while going from A1 to A2, where D(A0||Ai) is the Kullback-Leibler divergence or relative entropy between the two distributions generated by the corresponding automata (Cover and Thomas, 1991). The new solution A2 is compatible with the training data if the divergence increment relative to the size reduction, that is, the reduction of the number of states, is small enough. Formally, let alpha denote a compatibility threshold; then the compatibility is satisfied if [?](A1,A2) |A1|[?]|A2 |&lt; alpha. For this learning algorithm, alpha is the unique parameter; we tuned it to get better quality automata.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="30" type="sub_section">
      <SectionTitle>
4.3 Optimizing Automata
</SectionTitle>
      <Paragraph position="0"> We use three measures to evaluate the quality of a probabilistic automaton (and set the value of alpha optimally). The first, called test sample perplexity (PP), is based on the per symbol log-likelihood of strings x belonging to a test sample according to the distribution defined by the automaton. Formally, LL = [?] 1|S|summationtextx[?]S log(P(x)), where P(x) is the probability assigned to the string x by the automata. The perplexity PP is defined as PP = 2LL. The minimal perplexity PP = 1 is reached when the next symbol is always predicted with probability 1 from the current state, while PP = |S |corresponds to uniformly guessing from an alphabet of size |S|.</Paragraph>
      <Paragraph position="1"> The second measure we used to evaluate the quality of an automaton is the number of missed samples (MS). A missed sample is a string in the test sample that the automaton failed to accept. One such instance suffices to have PP undefined (LL infinite).</Paragraph>
      <Paragraph position="2"> Since an undefined value of PP only witnesses the presence of at least one MS we decided to count the number of MS separately, and compute PP without taking MS into account. This choice leads to a more accurate value of PP, while, moreover, the value of MS provides us with information about the generalization capacity of automata: the lower the value of MS, the larger the generalization capacities of the automaton. The usual way to circumvent undefined perplexity is to smooth the resulting automaton with unigrams, thus increasing the generalization capacity of the automaton, which is usually paid for with an increase in perplexity. We decided not to use any smoothing techniques as we want to compare bigram-based automata with MDI-based automata in the cleanest possible way. The PP and MS measures are relative to a test sample; we transformed section 00 of the PTB to obtain one.1 1If smoothing techniques are used for optimizing automata based on n-grams, they should also be used for optimizing MDI-based automata. A fair experiment for comparing the two automata-learning algorithms using smoothing techniques would consist of first building two pairs of automata. The first pair would consist of the unigram-based automaton together The third measure we used to evaluate the quality of automata concerns the size of the automata. We compute NumEdges and NumStates (the number of edges and the number of states of the automaton).</Paragraph>
      <Paragraph position="3"> We used PP, US, NumEdges, and NumStates to compare automata. We say that one automaton is of a better quality than another if the values of the 4 indicators are lower for the first than for the second. Our aim is to find a value of alpha that produces an automaton of better quality than the bigram-based counterpart. By exhaustive search, using all training data, we determined the optimal value of alpha. We selected the value of alpha for which the MDI-based automaton outperforms the bigram-based one.2 We exemplify our procedure by considering automata for the &amp;quot;One-Automaton&amp;quot; setting (where we used the same automata for all parts of speech). In Figure 3 we plot all values of PP and MS computed for different values of alpha, for each training set (i.e., left and right). From the plots we can identify values of alphathat produce automata having better values of PP and MS than the bigram-based ones.</Paragraph>
      <Paragraph position="4"> All such alphas are the ones inside the marked areas; automata induced using those alphas possess a lower value of PP as well as a smaller number of MS, as required. Based on these explorations  case, with alpha = 0.0001.</Paragraph>
      <Paragraph position="5"> we selected alpha = 0.0001 for building the automata used for grammar induction in the &amp;quot;One-Automaton&amp;quot; case. Besides having lower values of PP and MS, the resulting automata are smaller than the bigram based automata (Table 1). MDI compresses information better; the values in the tables with an MDI-based automaton outperforming the unigram-based one. The second one, a bigram-based automata together with an MDI-based automata outperforming the bigram-based one. Second, the two n-gram based automata smoothed into a single automaton have to be compared against the two MDI-based automata smoothed into a single automaton. It would be hard to determine whether the differences between the final automata are due to smoothing procedure or to the algorithms used for creating the initial automata. By leaving smoothing out of the picture, we obtain a clearer understanding of the differences between the two automata induction algorithms. 2An equivalent value of alpha can be obtained independently of the performance of the bigram-based automata by defining a measure that combines PP and MS. This measure should reach its maximum when PP and MS reach their minimums. null suggest that MDI finds more regularities in the sample set than the bigram-based algorithm.</Paragraph>
      <Paragraph position="6"> To determine optimal values for the &amp;quot;Many-Automata&amp;quot; case (where we learned two automata for each POS) we used the same procedure as for the &amp;quot;One-Automaton&amp;quot; case, but now for every individual POS. Because of space constraints we are not able to reproduce analogues of Figure 3 and Table 1 for all parts of speech. Figure 4 contains representative plots; the remaining plots are available online at http://www.science.</Paragraph>
      <Paragraph position="7"> uva.nl/~infante/POS.</Paragraph>
      <Paragraph position="8"> Besides allowing us to find the optimal alphas, the plots provide us with a great deal of information. For instance, there are two remarkable things in the plots for VBP (Figure 4, second row). First, it is one of the few examples where the bigram-based algorithm performs better than the MDI algorithm. Second, the values of PP in this plot are relatively high and unstable compared to other POS plots. Lower perplexity usually implies better quality automata, and as we will see in the next section, better automata produce better parsers. How can we obtain lower PP values for the VBP automata? The class of words tagged with VBP harbors many different behaviors, which is not surprising, given that verbs can differ widely in terms of, e.g., their sub-categorization frames. One way to decrease the PP values is to split the class of words tagged with VBP into multiple, more homogeneous classes. Note from Figures 3 and 4 that splitting the original sample sets into POS-dependent sets produces a huge decrease on PP. One attempt to implement this idea is lexicalization: increasing the information in the POS tag by adding the lemma to it (Collins, 1997; Sima'an, 2000). Lexicalization splits the class of verbs into a family of singletons producing more homogeneous classes, as desired. A different approach (Klein and Manning, 2003) consists in adding head information to dependents; words tagged with VBP are then split into classes according to the words that dominate them in the training corpus.</Paragraph>
      <Paragraph position="9"> Some POS present very high perplexities, but tags such as DT present a PP close to 1 (and 0 MS) for all values of alpha. Hence, there is no need to introduce further distinctions in DT, doing so will not increase the quality of the automata but will increase their number; splitting techniques are bound to add noise to the resulting grammars. The plots also indicate that the bigram-based algorithm captures them as well as the MDI algorithm.</Paragraph>
      <Paragraph position="10"> In Figure 4, third row, we see that the MDI-based automata and the bigram-based automata achieve the same value of PP (close to 5) for NN, but  the MDI misses fewer examples for alphas bigger than 1.4e [?] 04. As pointed out, we built the One-Automaton-MDI using alpha = 0.0001 and even though the method allows us to fine-tune each alpha in the Many-Automata-MDI grammar, we used a fixed alpha = 0.0002 for all parts of speech, which, for most parts of speech, produces better automata than bigrams. Table 2 lists the sizes of the automata. The differences between MDI-based and bigram-based automata are not as dramatic as in the &amp;quot;One-Automaton&amp;quot; case (Table 1), but the former again have consistently lower NumEdges and NumStates values, for all parts of speech, even where bigram-based automata have a lower perplexity.</Paragraph>
      <Paragraph position="11">  in the &amp;quot;Many-Automata&amp;quot; case, with alpha = 0.0002 for parts of speech.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="30" end_page="30" type="metho">
    <SectionTitle>
5 Parsing the PTB
</SectionTitle>
    <Paragraph position="0"> We have observed remarkable differences in quality between MDI-based and bigram-based automata.</Paragraph>
    <Paragraph position="1"> Next, we present the parsing scores, and discuss the meaning of the measures observed for automata in the context of the grammars they produce. The measure that translates directly from automata to grammars is automaton size. Since each automaton is transformed into a PCFG, the number of rules in the resulting grammar is proportional to the number of arcs in the automaton, and the number of non-terminals is proportional to the number of states.</Paragraph>
    <Paragraph position="2"> From Table 3 we see that MDI compresses information better: the sizes of the grammars produced by the MDI-based automata are an order of magnitude smaller that those produced using bigram-based automata. Moreover, the &amp;quot;One-Automaton&amp;quot; versions substantially reduce the size of the resulting grammars; this is obviously due to the fact that all POS share the same underlying automaton so that information does not need to be duplicated across parts of speech. To understand the meaning of PP and  MS in the context of grammars it helps to think of PCW-parsing as a two-phase procedure. The first phase consists of creating the rules that will be used in the second phase. And the second phase consists in using the rules created in the first phase as a PCFG and parsing the sentence using a PCF parser.</Paragraph>
    <Paragraph position="3"> Since regular expressions are used to build rules, the values of PP and MS quantify the quality of the set of rules built for the second phase: MS gives us a measure of the number rule bodies that should be created but that will not be created, and, hence, it gives us a measure of the number of &amp;quot;correct&amp;quot; trees that will not be produced. PP tells us how uncertain the first phase is about producing rules.</Paragraph>
    <Paragraph position="4"> Finally, we report on the parsing accuracy. We use two measures, the first one (%Words) was proposed by Lin (1995) and was the one reported in (Eisner, 1996). Lin's measure computes the fraction of words that have been attached to the right word. The second one (%POS) marks as correct a word attachment if, and only if, the POS tag of the head is the same as that of the right head, i.e., the word was attached to the correct word-class, even though the word is not the correct one in the sentence. Clearly, the second measure is always higher than the first one. The two measures try to capture the performance of the PCW-parser in the two phases described above: (%POS) tries to capture the performance in the first phase, and (%Words) in the second phase. The measures reported in Table 4 are the mean values of (%POS) and (%Words) computed over all sentences in section 23 having length at most 20. We parsed only those sentences because the resulting grammars for bigrams are too big: parsing all sentences without any serious pruning techniques was simply not feasible. From Table 4  we see that the grammars induced with MDI out-perform the grammars created with bigrams. Moreover, the grammar using different automata per POS outperforms the ones built using only a single automaton per side (left or right). The results suggest that an increase in quality of the automata has a direct impact on the parsing performance.</Paragraph>
  </Section>
  <Section position="7" start_page="30" end_page="30" type="metho">
    <SectionTitle>
6 Related Work and Discussion
</SectionTitle>
    <Paragraph position="0"> Modeling rule bodies is a key component of parsers.</Paragraph>
    <Paragraph position="1"> N-grams have been used extensively for this purpose (Collins 1996, 1997; Eisner, 1996). In these formalisms the generative process is not considered in terms of probabilistic regular languages. Considering them as such (like we do) has two advantages. First, a vast area of research for inducing regular languages (Carrasco and Oncina, 1994; Thollard et al., 2000; Dupont and Chase, 1998) comes in sight. Second, the parsing device itself can be viewed under a unifying grammatical paradigm like PCW-grammars (Chastellier and Colmerauer, 1969; Infante-Lopez and de Rijke, 2003). As PCW-grammars are PCFGs plus post tree transformations, properties of PCFGs hold for them too (Booth and Thompson, 1973).</Paragraph>
    <Paragraph position="2"> In our comparison we optimized the value of alpha, but we did not optimize the n-grams, as doing so would mean two different things. First, smoothing techniques would have to be used to combine different order n-grams. To be fair, we would also have to smooth different MDI-based automata, which would leave us in the same point.</Paragraph>
    <Paragraph position="3"> Second, the degree of the n-gram. We opted for n = 2 as it seems the right balance of informativeness and generalization. N-grams are used to model sequences of arguments, and these hardly ever have length &gt; 3, making higher degrees useless. To make a fair comparison for the Many-Automata grammars we did not tune the MDI-based automata individually, but we picked a unique alpha.</Paragraph>
    <Paragraph position="4"> MDI presents a way to compact rule information on the PTB; of course, other approaches exists. In particular, Krotov et al. (1998) try to induce a CW-grammar from the PTB with the underlying assumption that some derivations that were supposed to be hidden were left visible. The attempt to use algorithms other than n-grams-based for inducing of regular languages in the context of grammar induction is not new; for example, Kruijff (2003) uses profile hidden models in an attempt to quantify free order variations across languages; we are not aware of evaluations of his grammars as parsing devices.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML