XML Viewer - n01-1029

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1029_metho.xml
Size: 22,654 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="N01-1029">
  <Title>References</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Classic PLCG parsing
</SectionTitle>
    <Paragraph position="0"> The parameters of a PLCG are called projection probabilities. They are of the form p.Z ! X jX; G/; to be read as &amp;quot;given a completed constituent X dominated by a goal category G, the probability that there is a Z that has X as its first daughter and as its next daughters&amp;quot;. A PLCG contains essentially the same rules as a probabilistic context-free grammar (PCFG), but the latter conditions the rule probabilities on the mother category Z (production probabilities). In both cases the joint probability of the entire parse tree and the parsed sentence is the product of the production resp. projection probabilities of the local trees it consists of.</Paragraph>
    <Paragraph position="1"> While PCFG parsing proceeds from the top down or from the bottom up, PLCG naturally leads to a parsing scheme that is a mixture of both. The advantages of this are made clear in the subsections below. Formally, a PLCG parser has three elementary operations: SHIFT: given that an unexpanded constituent G starts from position i, shift the next word wi with probability ps.wijG/ (G is called the goal category); PROJECT: given a complete constituent X, dominated by a goal category G, starting in position i and ending in j, predict a mother constituent Z starting in position i and completed up till position j, and zero or more unexpanded sister constituents starting in j with probability pp.Z ! X jX; G/; ATTACH: given a complete constituent X dominated by a goal category G, identify the first as the latter with probability pa.X; G/.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Extending the PLCG framework
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Synchronous chart parsing with PLCG
</SectionTitle>
      <Paragraph position="0"> In this subsection we present the basic parsing algorithm and its data structures and operations. In the subsections that follow, we will introduce lexicalization and context-sensitivity by extending this framework.</Paragraph>
      <Paragraph position="1"> The PLCG parsing process is interpreted as a search through a network of states, a compact representation of the search space. The network nodes correspond to states and the arcs to operations (annotated with transition probabilities). A (partial) parse corresponds to a (partial) path through the network. The joint probability of a partial parse and the covered part of the sentence is equal to the partial path probability, i.e. the product of the probabilities of the transitions in the path.</Paragraph>
      <Paragraph position="2">  We write a state q as q D.GIZ !i X ? j I ; / (1) where G is the goal category, Z is the category of a constituent from position i complete up till position j, X is the first daughter category, denotes the remaining unresolved daughters of Z, and and are forward and inner probabilities defined below. The wildcard ? symbolizes zero or more resolved daughter categories: we make abstraction of the identities of resolved daughters (except the first one), because further parser moves do not depend on them. If is empty, q is called a complete state, otherwise q is a goal state.</Paragraph>
      <Paragraph position="3">  Given a state q as defined in (1). We define its forward probability D .q/ as the sum of the probabilities of the paths ending in q, starting in the initial state and generating w j 10 . As a consequence, .q/D p.w j 10 ; q/ (joint probability).</Paragraph>
      <Paragraph position="4"> The inner probability D .q/ is the sum of the probabilities of the paths generating w j 1i , ending in q and starting with a SHIFT of wi. As a consequence, .q/D p.w j 1i ; q/.</Paragraph>
      <Paragraph position="5"> Note that the forward and inner probabilities of the final state should be identical and equal to p.S/.  In this paragraph we reformulate the classic PLCG parser operations in terms of transitions between states. We hereby specify update formulas for forward and inner probabilities.</Paragraph>
      <Paragraph position="6"> Shift The SHIFT operation starts from a goal state q D.GIZ !i X ? j Y I ; / (2) and shifts the next word w at position j of the input by updating q0 or generating a new state q0 where2</Paragraph>
      <Paragraph position="8"> If q0 already lives in the chart, only its forward probability is updated. The given update formula is justified by the relation</Paragraph>
      <Paragraph position="10"> where the sum is over all SHIFT transitions from q to q0 and p.q )q0/ denotes the transition probability from q to q0. Computing .q0/ is a trivial case of the definition.</Paragraph>
      <Paragraph position="11">  set to p if there was no q0 in the chart yet, otherwise 0 is incremented with p.</Paragraph>
      <Paragraph position="12"> Projection From a complete state, two transitions are possible: ATTACH to a goal state with a probability pa or PROJECT with a probability 1 pa. PROJECT starts from a complete state q D.GIZ !i X ? jI ; / (5) and generates or updates a state q0D.GIT !i Z ? j I 0CD p; 0CD p/ (6) with transition probability pD pp.T; jZ; G/ .1 pa.Z; G//: (7)  Again, the forward probability is computed recursively as a sum of products. Now 0 needs to be accumulated, too: the constituent Z in general may be resolved with more than one different X, which each time adds to 0.</Paragraph>
      <Paragraph position="13"> Note that a mother constituent inherits G from her first daughter (left-corner).</Paragraph>
      <Paragraph position="14"> Attachment Given a complete state q as in (5) where G D Z and some goal state q00 in the partial path leading to q</Paragraph>
      <Paragraph position="16"> and (6)? The reason is that ATTACH makes use of non-local constraints: the transition from q to q0 is only possible if a matching goal state q00 occurred in a path leading to q. Therefore computing as in (3) and (6) would include all paths that generate q0, also those that do not contain q00. Instead, the update of 0 in (9) combines all paths leading to q00 with the paths starting from q00 and ending in q. The update of 0 follows an analogous reasoning.</Paragraph>
      <Paragraph position="17">  The parser produces a set of states that can be conveniently organized in a staircase-shaped chart similar to the one used by the CYK parser. In the chart cell with coordinates .i; j/ we store all the states starting in i and completed up till position j.</Paragraph>
      <Paragraph position="18">  Following (Chelba, 2000), we represent a sentence by a sequence of word identities starting with a sentence-begin token hsi, that is used in the context but not predicted, followed by a sentence-end token h/si, that is predicted by the model. We are collecting the sentence proper together withh/siunder a node labeled TOP0, and the TOP0node together withhsiunder a TOP node. The parser starts from the initial state</Paragraph>
      <Paragraph position="20"> After processing the sentence S D wN 10 and provided a full parse was found, the final state</Paragraph>
      <Paragraph position="22"> is found in cell . 1; N/.</Paragraph>
      <Paragraph position="23"> Now we are ready to formulate the parsing algorithm. Note that we treat an ATTACH operation as a special PROJECT, as explained in Sec. 4.1.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Lexicalization and context-sensitivity
</SectionTitle>
      <Paragraph position="0"> Probably the most important shortcoming of PCFG's is the assumption of context-free rule probabilities, i.e. the probability distribution over possible righthand sides given a lefthand side is independent from the function or position of the left-hand side. This assumption is quite wrong. For instance, in the Penn Treebank an NP in subject position produces a personal pronoun in 13.7% of the cases, while in object position it only does so in 2.1% of the cases (Manning and Carpenter, 1997).</Paragraph>
      <Paragraph position="1"> Furthermore, findings from corpus-based linguistic studies and developments in functional grammar indicate that the lexical realization of a context, besides its syntactic analysis, strongly influences patterns of syntactic preference. Today's best automatic parsers are made substantially more efficient and accurate by applying lexicalized grammar (Manning and Sch&amp;quot;utze, 1999).</Paragraph>
      <Paragraph position="2">  In our work we did not attempt to find semantic generalizations (such as casting a verb form to its infinitive form or finding semantic attributes); our simple (but probably suboptimal) approach, borrowed from (Magerman, 1994; Collins, 1996; Chelba, 2000), is to percolate words upward in the parse tree in the form in which they appear in the sentence. In our experiments, we opted to hardcode the head positions as part of the projection rules.3 The nodes of the resulting partial parse trees thus are annotated with a category label (the CAT feature) and a lexical label (the WORD feature).</Paragraph>
      <Paragraph position="3"> The notation (1) of a state is now replaced with</Paragraph>
      <Paragraph position="5"> where z is the WORD of the mother (possibly empty), x is the WORD of the first daughter (not empty), and the extended context contains</Paragraph>
      <Paragraph position="7"> goal state dominating q1.</Paragraph>
      <Paragraph position="8"> If the grammar only contains unary and binary rules, L1 and L2 correspond with Chelba's concept of exposed heads -- which was in fact the idea behind the definition above. The mixed bottom-up and top-down parsing order of PLCG allows to condition q on a goal constituent G higher up in the partial tree containing q; this turns out to significantly improve efficiency with respect to Jelinek's bottom-up chart parser.</Paragraph>
      <Paragraph position="9"> 3Inserting a probabilistic head percolation model, as in (Chelba, 2000), may be an alternative.</Paragraph>
      <Paragraph position="10">  In this section, we extend the parser operations of Sec. 3.1.3 to handle context-sensitive and lexicalized states. The forward and inner probability update formulas remain formally the same and are not repeated here.</Paragraph>
      <Paragraph position="11"> The SHIFT operation q )s q0 is a transition from q to q0 with probability p where</Paragraph>
      <Paragraph position="13"> The PROJECT operation q )p q0 is a transition from q to q0 with probability p where</Paragraph>
      <Paragraph position="15"> If Z is in head position, t D z; otherwise t is left unspecified.</Paragraph>
      <Paragraph position="16"> The ATTACH operation q )a q0 is a transition from q to q0 given q00 with a probability p where</Paragraph>
      <Paragraph position="18"> If Y is in head position, z0D y; otherwise, z0Dz.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 PLCG-based language model
</SectionTitle>
      <Paragraph position="0"> A language model (LM) is a word sequence predictor (or an estimator of word sequence probabilities). Following common practice in language modeling for speech recognition, we predict words in a sentence from left to right4 with probabilities of the form p.wjjw j 10 /. Suppose the parser has worked its way through w j 10 and is about to make w j SHIFT transitions. Then we can write</Paragraph>
      <Paragraph position="2"> stages of the search.</Paragraph>
      <Paragraph position="3"> where a1 j is the set of goal states in position j. The factor p.wjjq/ is given by the transition probability associated with the SHIFT operation.5 On the other hand, note that</Paragraph>
      <Paragraph position="5"> where a3 j is the set of states in position j that resulted from SHIFT operations. The first equation holds because there are only PROJECT and ATTACH transitions between the elements of a3 j and a1 j , since the sum of outgoing transitions from each state in that region equals 1 and therefore the total probability mass is preserved. By inserting (15) into</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Model reestimation
</SectionTitle>
      <Paragraph position="0"> The pp, ps and pa submodels can be reestimated with iterative expectation-maximization, which needs the computation of frequency expectations. For this purpose we define the outer probability of a state q, written as .q/, as the sum of probabilities of precisely that part of the paths that is not included in the inner probability of q. The outer probability of a complete state is analogous to Baker's (1979) definition of an outside probability.</Paragraph>
      <Paragraph position="1"> The outer probabilities are computed in the reverse direction starting from qF, provided that a list of backward references were stored with each state</Paragraph>
      <Paragraph position="3"> Reverse ATTACH (cfr. (80, 90, 100)): CD 0p and 00CD 0 p= 00. These formulas are made clear in Fig. 1.</Paragraph>
      <Paragraph position="4"> Reverse PROJECT (cfr. (50, 60, 70)): CD 0p. A reverse SHIFT is not necessary, but could be  before it propagates to other items. A topological sort could serve this purpose.</Paragraph>
      <Paragraph position="6"> ties along a single path at attachment of q to q00 resulting into q0.</Paragraph>
      <Paragraph position="7"> Now the expected frequency of a transition o 2 fs; p; agfrom q to q0 in a full parse of S is</Paragraph>
      <Paragraph position="9"> The expected frequencies required for the reestimation of the conditional distributions are then obtained by summing (18) over the state attributes from which the required distribution is independent.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Empirical evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Modeling
</SectionTitle>
      <Paragraph position="0"> We have trained two sets of models. The first set was trained on sections 0-20 of the Penn Treebank (PTB) (Marcus et al., 1995) using sections 21-22 for development decisions and tested on sections 23-24. The second set was trained on the BLLIP WSJ Corpus (BWC), which is a machine-parsed (Charniak, 2000) version of (a selection of) the ACL/DCI corpus, very similar to the selection made for the WSJ0/1 CSR corpus. As the training set, we used the BWC minus the WSJ0/1 &amp;quot;dfiles&amp;quot; and &amp;quot;efiles&amp;quot; intended for CSR development and evaluation testing.</Paragraph>
      <Paragraph position="1"> The PTB devset was used for fixing submodel parameterizations and software debugging, while perplexities are measured on the PTB testset. The BWC trainset was used in rescoring N-best lists in order to assess the models' potential in speech recognition. Both the PTB and BWC underwent the following preprocessing steps: (a) A vocabulary was fixed as the 10k (PTB) resp. 30k (BWC) most frequent words; out-of-vocabulary words were replaced byhunki. Numbers in Arabic digits were replaced by one token 'N'. (b) Punctuation was removed. (c) All characters were converted to lowercase. (d) All parse trees were binarized in much the same way as detailed in (Chelba, 2000, pp. 12-17); non-terminal unary productions were eliminated by collapsing two nodes connected by a unary branch to one node annotated with a combined label. This step allowed a simple implementation and comparison of results with related publications. We distinguished 1891 different projections, 143 different non-terminal categories and 41 different parts-ofspeech. (e) All constituents were annotated with a lexical head using deterministic rules by Magerman (1994).</Paragraph>
      <Paragraph position="2"> The training then proceded by decomposing all parse trees into sequences of SHIFT, PROJECT and ATTACH transitions. The submodels were finally estimated from smoothed relative counts of transitions using standard language modeling techniques: Good-Turing back-off (Katz, 1987) and deleted interpolation (Jelinek, 1997).</Paragraph>
      <Paragraph position="3"> Shift submodel The SHIFT submodel implements (40). Finding a good parameterization entails fixing the features that should explicitly appear in the context and in which order, so that all information-bearing elements are incorporated, with limited data fragmentation. This is not a straightforward task. We went through an iterative process of intuitively guessing which feature should be added or removed from the context or changing the order, building a corresponding model and evaluating its conditional perplexity (CPPL) against the devset. The CPPL of a SHIFT submodel is its perplexity measured on a test set consisting of (context, word to be predicted) pairs (i.e. the SHIFT transitions according to a certain parameterization) extracted from the correct parse trees of a parsed test corpus. In other words, the CPPL is an underbound of the PPL in that it would be the PPL from an ideal parser. We finally concluded that the parameterization (notation being consistent with (20))</Paragraph>
      <Paragraph position="5"> where the conditioning sequence is ordered from most to least significant, is optimal for our purposes in the given experimental conditions. The CPPL of  plexities.</Paragraph>
      <Paragraph position="6"> model GT DI (a) word trigram 190 193 (b) PLCG-based LM 185 187 (c) linear interpolation: .6(a) + .4(b) 159 166  this model on the PTB devset is 48, which displays the great potential of a correct syntactic partial parse to predict the next word.</Paragraph>
      <Paragraph position="7"> Project/attach submodel The ATTACH submodel can be incorporated into the PROJECT submodel by treating the attachment as a special kind of projection. This approach was systematically applied since it sped up parsing. Having the possibility to choose different parameterizations in separate PROJECT and ATTACH submodels did not lower perplexity and increased execution time. Therefore, we always used combined PROJECT/ATTACH submodels in further experiments. null The PROJECT/ATTACH submodel implements (70) and (100). The process of finding an appropriate parameterization used to build the SHIFT submodel was also applied here. Finally we concluded that the parameterization (notation being consistent with (50)) pp.T; jZ; G; z/ (20) is optimal for our purposes in the given experimental conditions.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Evaluation of PTB models
</SectionTitle>
      <Paragraph position="0"> Table 1 lists test set perplexities (excluding OOVs and unparsed parts of sentences) of Good-Turing smoothed back-off models (GT) and deleted-interpolation smoothed (DI) models trained on the PTB trainset and tested on the PTB testset. We observed similar results with both smoothing methods. As a baseline, word trigram (a) was trained and tested on the same material. The PPL obtained with the PLCG-based LM (b), using parametrizations (19) and (20), is not much lower than the base-line PPL.7 Interpolation (c) with the baseline however yields a relative PPL reduction of 14 to 16% with respect to the baseline.</Paragraph>
      <Paragraph position="1"> 7Using parametrizations pp.T; jz; G; L1:CAT/ for projection from W-items and pp.T; jG; Z; X; z/ for other projections, we recently obtained a PPL of 178 (and 155 when interpolated). This result is left out from the discussion in order to keep it clear and complete.</Paragraph>
      <Paragraph position="2">  on the DARPA WSJ Nov '92 evaluation test set, non-verbalized punctuation. The models are smoothed with Good-Turing back-off (WER results in column GT) or deleted interpolation (DI).</Paragraph>
      <Paragraph position="3"> rescoring model GT DI (a) DARPA word trigram 10.44 (b) BWC word trigram 11.31 11.08 (c) BWC Chelba-Jelinek SLM 10.86 (d) (a) and (c) combined 9.82 (e) (b) and (c) combined 10.60 (f) BWC PLCG-based SLM 11.45 11.48 (g) (a) and (e) combined 9.85 9.87 (h) (b) and (e) combined 10.38 10.58 (i) Best possible 4.46 4.46  Parse accuracy is around 79% for both labeled precision and recall on section 23 of PTB (excluding unparsed sentences, about 4% of all sentences). In comparison, with our own implementation of Chelba-Jelinek, we measured a labeled precision and recall of 57% and 75% on the same input. These results seem fairly low compared to other recent work on large-scale parsing, but may be partly due to the left-to-right restriction of our language models,8 which for instance prohibits word-lookahead. Moreover, while we measured accuracy against a binarized version of PTB, the original parses are rather flat, which may allow higher accuracies.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Evaluation of BWC-models
</SectionTitle>
      <Paragraph position="0"> The main target application of our research into LM is speech recognition. We performed N-best list rescoring experiments on the DARPA WSJ Nov '92 evaluation test set, non-verbalized punctuation.</Paragraph>
      <Paragraph position="1"> The N-best lists were obtained from the L&amp;H Voice Xpress v4 speech recognizer using the standard tri-gram model included in the test suite (20k open vocabulary, no punctuation).</Paragraph>
      <Paragraph position="2"> In Table 2 we report word-recognition error rates (WER) after rescoring using Chelba-Jelinek and PLCG-based models. Both DI and GT smoothing methods yielded very comparable results. Due to technical limitations, all the models except the baseline trigram were trimmed by ignoring highestorder events that occurred only once.</Paragraph>
      <Paragraph position="3"> The best PLCG-based SLM trained on the BWC train set (f) performs worse than the official word trigram (a). However, since the BWC does not completely cover the complete WSJ0 LM train material 8Not to be confused with left-to-right parsing.</Paragraph>
      <Paragraph position="4"> and slightly differs in tokenization, it is more fair to compare with the performance of a word trigram trained on the BWC train set (b). Results (g) and (h) show that the PLCG-based SLM lowers WER with 4% relative when used in combination with the baseline models. A comparable result was obtained with the Chelba-Jelinek SLM (results (d) and (e)).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML