XML Viewer - w00-0725

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0725_metho.xml
Size: 5,600 bytes
Last Modified: 2025-10-06 14:07:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0725">
  <Title>A Comparison of PCFG Models*</Title>
  <Section position="3" start_page="0" end_page="123" type="metho">
    <SectionTitle>
2 A generalized k-gram model
</SectionTitle>
    <Paragraph position="0"> Recall that k-gram models are stochastic models for the generation of sequences sl,s2,...</Paragraph>
    <Paragraph position="1"> based on conditional probabilities, that is:  1. the probability P(sls2... stlM) of a sequence in the model M is computed as a product pM( Sl )pM( S2\[Sl ) &amp;quot; &amp;quot; &amp;quot; pM( St\[Sl S2 . . . St-l), and 2. the dependence of the probabilities PM  on previous history is assumed to be restricted to the immediate preceding context, in particular, the last k - 1 words: PM(St\[Sl . . . St-1) ---- pM(St\[St-k+l . . . St-1). Note that in this kind of models, the probability that the observation st is generated at time t is computed as a function of the subsequence of length k - 1 that immediately precedes st (this is called a state). However, in the case of trees, it is not obvious what context should be taken in to account. Indeed, there is a natural preference when processing strings (the usual left-to-right  order) but there are at least two standard ways of processing trees: ascending (or bottom-up) analysis and descending (or top-down) analysis. Ascending tree automata recognize a wider class of languages (Nivat and Podelski, 1!197; G~cseg and Steinby, 1984) and, therefore, they allow for richer descriptions.</Paragraph>
    <Paragraph position="2"> Thus, our model will compute the expansion probability for a given node as a function of the subtree of depth k - 2 that the node generates 1, i.e., every state stores a subtree of depth k - 2. In the particular case k = 2, only the label of the node is taken into account (this is analogous to the standard bigram model for strings) and the model coincides with the simple rule-counting approach. For instance, for the tree depicted in Fig. 1, the following rules are obtained:</Paragraph>
    <Paragraph position="4"> However, in case k = 3, the expansion probabilities depend on the states that are defined by the node label, the number of descendents the node and the sequence of labels in the descendents (if any). Therefore, for the same tree the following rules are obtained in this case:</Paragraph>
    <Paragraph position="6"> where each state has the form X(Z1...Zm).</Paragraph>
    <Paragraph position="7"> This is equivalent to a relabeling of the parse tree before extracting the rules.</Paragraph>
    <Paragraph position="8"> Finally, in the parent annotated model (PA) described in (Johnson, 1998) the states depend 1Note that in our notation a single node tree has depth 0. This is in contrast to strings, where a single symbol has length 1.</Paragraph>
    <Paragraph position="9"> on both the node label and the node's parent  It is obvious that the k = 3 and PA models incorporate contextual information that is not present in the case k = 2 and, then, a higher number of rules for a fixed number of categories is possible. In practice, due to the finite size of the training corpus, the number of rules is always moderate. However, as higher values of k lead to an enormous number of possible rules, huge data sets would be necessary in order to have a reliable estimate of the probabilities for values above k = 3. A detailed mathematical description of these type of models can be found in (Rico-Juan et al., 2000)</Paragraph>
  </Section>
  <Section position="4" start_page="123" end_page="124" type="metho">
    <SectionTitle>
3 Experimental results
</SectionTitle>
    <Paragraph position="0"> The following table shows some data obtained with the three different models and the WSJ corpus. The second column contains the number of rules in the grammar obtained from a training subset of the corpus (24500 sentences, about the first half in the corpus) and the last one contains the percentage of sentences in a test set (2000 sentences) that cannot be parsed by the grammar.</Paragraph>
    <Paragraph position="1">  As expected, the number of rules obtained increases as more information is conveyed by the node label, although this increase is not extreme. On the other hand, as the generalization power decreases, some sentences in the test set become unparsable, that is, they cannot be generated by the grammar. The number of unparsed sentences is very small for the parent annotated model but cannot be neglected for the k = 3 model.</Paragraph>
    <Paragraph position="2"> As we will use the perplexity of a test sample S = {wl, ..., w\]s\] } as an indication of the quality of the model,</Paragraph>
    <Paragraph position="4"> , unparsable sentences would produce an infinite perplexity. Therefore, we studied the perplexity of the test set for a linear combination of two models Mi and Mj with p(wklMi -- Mj) = )~p(wklMi) + (1 - ~)p(wklMj). The mixing parameter ~ was chosen in order to minimize the perplexity. Figure 2 shows that there is always  Lower line: k = 2 and k = 3.</Paragraph>
    <Paragraph position="5"> a minimum perplexity for an intermediate value of ),. The best results were obtained with a mixture of the k-gram models for k = 2 and k = 3 with a heavier component (73%) of the last one. The minimum perplexity PPm and the corresponding value of ~ obtained are shown in the following table: Mixture model PPm Am k = 2 and PA 107.9 0.58 k = 2 and k = 3 91.0 0.27 It is also worth to remark that the model k = 3 is the less ambiguous model and, then, parsing of sentences becomes much faster.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML