File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1064_metho.xml
Size: 7,282 bytes
Last Modified: 2025-10-06 14:14:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1064"> <Title>A Structured Language Model</Title> <Section position="3" start_page="0" end_page="498" type="metho"> <SectionTitle> 2 The Basic Idea and Terminology </SectionTitle> <Paragraph position="0"> Consider predicting the word barked in the sentence: null the dog I heard yesterday barked again. A 3-gram approach would predict barked from (heard, yesterday) whereas it is clear that the predictor should use the word dog which is outside the reach of even 4-grams. Our assumption is that what enables us to make a good prediction of barked is the syntactic structure in the past. The correct partial parse of the word history when predicting barked is shown in Figure 1. The word dog is called the headword of the constituent ( the (dog (...) )) and dog is an exposed headword when predicting barked -- topmost head-word in the largest constituent that contains it. The syntactic structure in the past filters out irrelevant words and points to the important ones, thus enabling the use of long distance information when predicting the next word. Our model will assign a probability P(W, T) to every sentence W with every possible binary branching parse T and every possible headword annotation for every constituent of T. Let W be a sentence of length I words to which we have prepended <s> and appended </s> so that wo =<s> and wl+l =</s>. Let Wk be the word k-prefix w0... wk of the sentence and WkT~ the word-parse k-prefix. To stress this point, a word-parse k-prefix contains only those binary trees whose span is completely included in the word kprefix, excluding wo =<s>. Single words can be regarded as root-only trees. Figure 2 shows a word-parse k-prefix; h_0 .. h_{-m} are the exposed headwords. A complete parse -- Figure 3 -- is any binary parse of the wl ... wi </s> sequence with the restriction that </s> is the only allowed headword.</Paragraph> <Paragraph position="1"> Note that (wl...wi) needn't be a constituent, but for the parses where it is, there is no restriction on which of its words is the headword.</Paragraph> <Paragraph position="2"> The model will operate by means of two modules: * PREDICTOR predicts the next word wk+l given the word-parse k-prefix and then passes control to the PARSER; * PARSER grows the already existing binary branching structure by repeatedly generating the transitions adjoin-left or adjoin-right until it passes control to the PREDICTOR by taking a null transition.</Paragraph> <Paragraph position="3"> The operations performed by the PARSER ensure that all possible binary branching parses with all possible headword assignments for the w~... wk word sequence can be generated. They are illustrated by Figures 4-6. The following algorithm describes how the model generates a word sequence with a complete parse (see Figures 3-6 for notation):</Paragraph> <Paragraph position="5"> It is easy to see that any given word sequence with a possible parse and headword annotation is generated by a unique sequence of model actions.</Paragraph> </Section> <Section position="4" start_page="498" end_page="499" type="metho"> <SectionTitle> 3 Probabilistic Model </SectionTitle> <Paragraph position="0"> The probability P(W, T) can be broken into: 1+1 p</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> * t~ denotes the i-th PARSER operation carried out at position k in the word string;</Paragraph> <Paragraph position="6"> As can be seen (wk, Wk-lTk-1, t k k ...ti_l) is one of the Nk word-parse k-prefixes of WkTk, i = 1, Nk at position k in the sentence.</Paragraph> <Paragraph position="7"> To ensure a proper probabilistic model we have to make sure that (1) and (2) are well defined conditional probabilities and that the model halts with probability one. A few provisions need to be taken:</Paragraph> <Paragraph position="9"> ensures that the headword of a complete parse is</Paragraph> <Paragraph position="11"> ensures that the model halts with probability one.</Paragraph> <Section position="1" start_page="498" end_page="499" type="sub_section"> <SectionTitle> 3.1 The first model </SectionTitle> <Paragraph position="0"> The first term (1) can be reduced to an n-gram LM,</Paragraph> <Paragraph position="2"> A simple alternative to this degenerate approach would be to build a model which predicts the next word based on the preceding p-1 exposed headwords and n-1 words in the history, thus making the following equivalence classification:</Paragraph> <Paragraph position="4"> The approach is similar to the trigger LM(Lau93), the difference being that in the present work triggers are identified using the syntactic structure.</Paragraph> </Section> <Section position="2" start_page="499" end_page="499" type="sub_section"> <SectionTitle> 3.2 The second model </SectionTitle> <Paragraph position="0"> Model (2) assigns probability to different binary parses of the word k-prefix by chaining the elementary operations described above. The workings of the PARSER are very similar to those of Spatter (Jelinek94). It can be brought to the full power of Spatter by changing the action of the adjoin operation so that it takes into account the terminal/nonterminal labels of the constituent proposed by adjoin and it also predicts the nonterminal label of the newly created constituent; PREDICTOR will now predict the next word along with its POS tag. The best equivalence classification of the WkTk word-parse k-prefix is yet to be determined. The Collins parser (Collins96) shows that dependencygrammar-like bigram constraints may be the most adequate, so the equivalence classification \[WkTk\] should contain at least (h_0, h_{-1}}.</Paragraph> </Section> </Section> <Section position="5" start_page="499" end_page="499" type="metho"> <SectionTitle> 4 Preliminary Experiments </SectionTitle> <Paragraph position="0"> Assuming that the correct partial parse is a function of the word prefix, it makes sense to compare the word level perplexity(PP) of a standard n-gram LM with that of the P(wk/Wk-ITk-1) model. We developed and evaluated four LMs: * 2 bigram LMs P(wk/Wk-lTk-1) = P(Wk/Wk-1) referred to as W and w, respectively; wk-1 is the previous (word, POStag) pair;</Paragraph> <Paragraph position="2"> ferred to as H and h, respectively; h0 is the previous exposed (headword, POS/non-term tag) pair; the parses used in this model were those assigned manually in the Penn Treebank (Marcus95) after undergoing headword percolation and binarization.</Paragraph> <Paragraph position="3"> All four LMs predict a word wk and they were implemented using the Maximum Entropy Modeling Toolkit 1 (Ristad97). The constraint templates in the {W,H} models were:</Paragraph> <Paragraph position="5"> and in the {w,h} models they were: 4 <= <*>_<*> <7>; 2 <= <7>_<*> <7>; <.> denotes a don't care position, <7>_<7> a (word, tag) pair; for example, 4 <= <7>_<*> <7> will trigger on all ((word, any tag), predicted-word) pairs that occur more than 3 times in the training data. The sentence boundary is not included in the PP calculation. Table 1 shows the PP results along with</Paragraph> </Section> class="xml-element"></Paper>