File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1025_metho.xml
Size: 23,293 bytes
Last Modified: 2025-10-06 14:14:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1025"> <Title>A New Statistical Parser Based on Bigram Lexical Dependencies</Title> <Section position="4" start_page="0" end_page="188" type="metho"> <SectionTitle> 2 The Statistical Model </SectionTitle> <Paragraph position="0"> The aim of a parser is to take a tagged sentence as input (for example Figure l(a)) and produce a phrase-structure tree as output (Figure l(b)). A statistical approach to this problem consists of two components. First, the statistical model assigns a probability to every candidate parse tree for a sentence. Formally, given a sentence S and a tree T, the model estimates the conditional probability P(T\[S).</Paragraph> <Paragraph position="1"> The most likely parse under the model is then:</Paragraph> <Paragraph position="3"> Second, the parser is a method for finding Tbest.</Paragraph> <Paragraph position="4"> This section describes the statistical model, while section 3 describes the parser.</Paragraph> <Paragraph position="5"> The key to the statistical model is that any tree such as Figure l(b) can be represented as a set of baseNPs 2 and a set of dependencies as in Figure l(c). We call the set of baseNPs B, and the set of dependencies D; Figure l(d) shows B and D for this example. For the purposes of our model,</Paragraph> <Paragraph position="7"> S is the sentence with words tagged for part of speech. That is, S =< (wl,tl), (w2,t2)...(w~,t,) >.</Paragraph> <Paragraph position="8"> For POS tagging we use a maximum-entropy tagger described in (Ratnaparkhi 96). The tagger performs at around 97% accuracy on Wall Street Journal Text, and is trained on the first 40,000 sentences of the Penn Treebank (Marcus et al. 93).</Paragraph> <Paragraph position="9"> Given S and B, the reduced sentence :~ is defined as the subsequence of S which is formed by removing punctuation and reducing all baseNPs to their head-word alone.</Paragraph> <Paragraph position="10"> ~A baseNP or 'minimal' NP is a non-recursive NP, i.e. none of its child constituents are NPs. The term was first used in (l:tamshaw and Marcus 95).</Paragraph> <Paragraph position="11"> parse-tree (the correct one); (c) A dependency representation of (b). Square brackets enclose baseNPs (heads of baseNPs are marked in bold). Arrows show modifier --* head dependencies. Section 2.1 describes how arrows are labeled with non-terminal triples from the parse-tree. Non-head words within baseNPs are excluded from the dependency structure; (d) B, the set of baseNPs, and D, the set of dependencies, are extracted from (c).</Paragraph> <Paragraph position="12"> Thus the reduced sentence is an array of word/tag pairs, S=< (t~l,tl),(@2,f2)...(@r~,f,~)>, where m _~ n. For example for Figure l(a) Example 1 S = < (Smith, ggP), (president, NN), (of, IN), (IBM, NNP), (announced, VBD), (resignation, N N), (yesterday, N g) > Sections 2.1 to 2.4 describe the dependency model.</Paragraph> <Paragraph position="13"> Section 2.5 then describes the baseNP model, which uses bigram tagging techniques similar to (Ramshaw and Marcus 95; Church 88).</Paragraph> <Section position="1" start_page="184" end_page="185" type="sub_section"> <SectionTitle> 2.1 The Mapping from Trees to Sets of Dependencies </SectionTitle> <Paragraph position="0"> The dependency model is limited to relationships between words in reduced sentences such as Example 1. The mapping from trees to dependency structures is central to the dependency model. It is defined in two steps: 1. For each constituent P --.< C1...Cn > in the parse tree a simple set of rules 3 identifies which of the children Ci is the 'head-child' of P. For example, NN would be identified as the head-child of NP ~ <DET JJ 33 NN>, VP would be identified as the head-child of $ -* <NP VP>. Head-words propagate up through the tree, each parent receiving its head-word from its head-child. For example, in S --~ </~P VP>, S gets its head-word, announced, 3The rules are essentially the same as in (Magerman 95; Jelinek et al. 94). These rules are also used to find the head-word of baseNPs, enabling the mapping from S and B to S.</Paragraph> <Paragraph position="1"> from its head-child, the VP.</Paragraph> <Paragraph position="3"> from the tree in Figure 2. Figure 3 illustrates how each constituent contributes a set of dependency relationships. VBD is identified as the head-child of VP ---,&quot; <VBD NP NP>. The head-words of the two NPs, resignation and yesterday, both modify the head-word of the VBD, announced. Dependencies are labeled by the modifier non-terminal, lip in both of these cases, the parent non-terminal, VP, and finally the head-child non-terminal, VBD. The triple of non-terminals at the start, middle and end of the arrow specify the nature of the dependency relationship <liP,S,VP> represents a subject-verb dependency, <PP ,liP ,liP> denotes prepositional phrase modification of an liP, and so on 4.</Paragraph> <Paragraph position="4"> v~ case n = 3) contributes n - 1 dependencies.</Paragraph> <Paragraph position="5"> Each word in the reduced sentence, with the exception of the sentential head 'announced', modifies exactly one other word. We use the notation</Paragraph> <Paragraph position="7"> to state that the jth word in the reduced sentence is a modifier to the hjth word, with relationship Rj 5. AF stands for 'arrow from'. Rj is the triple of labels at the start, middle and end of the arrow. For example, wl = Smith in this sentence, 4The triple can also be viewed as representing a semantic predicate-argument relationship, with the three elements being the type of the argument, result and functot respectively. This is particularly apparent in Categorial Grammar formalisms (Wood 93), which make an explicit link between dependencies and functional application.</Paragraph> <Paragraph position="8"> case, AF(5) = (0, < S >).</Paragraph> <Paragraph position="9"> and ~5 = announced, so AF(1) = (5, <NP,S,VP>).</Paragraph> <Paragraph position="10"> D is now defined as the m-tuple of dependencies: n = {(AF(1),AF(2)...AF(m)}. The model assumes that the dependencies are independent, so that:</Paragraph> <Paragraph position="12"/> </Section> <Section position="2" start_page="185" end_page="186" type="sub_section"> <SectionTitle> 2.2 Calculating Dependency Probabilities </SectionTitle> <Paragraph position="0"> This section describes the way P(AF(j)\]S, B) is estimated. The same sentence is very unlikely to appear both in training and test data, so we need to back-offfrom the entire sentence context. We believe that lexical information is crucial to attachment decisions, so it is natural to condition on the words and tags. Let 1) be the vocabulary of all words seen in training data, T be the set of all part-of-speech tags, and TTCAZA f be the training set, a set of reduced sentences. We define the following functions: * C ( (a, b/, (c, d / ) for a, c c l\], and b, d c 7- is the number of times (a,b I and (c,d) are seen in the same reduced sentence in training data. 6 Formally, C((a,b>, <c,d>)= Z h = <a, b), : <e, d)) * ~ C/ T'R,,AZ~/&quot; k,Z=l..I;I, z#k where h(m) is an indicator function which is 1 if m is true, 0 if x is false.</Paragraph> <Paragraph position="1"> * C (R, (a, b), (c, d) ) is the number of times (a, b / and (c, d) are seen in the same reduced sentence in training data, and {a, b) modifies (c,d) with relationship R. Formally, C (R, <a, b), <e, d) ) = Z h(S\[k\] = (a,b), SIll = (c,d), AF(k) = (l,R)) -C/ c T'R~gZ2q&quot; k3_-1..1~1, lC/:k (6) * F(RI(a, b), (c, d) ) is the probability that (a, b) modifies (c, d) with relationship R, given that (a, b) and (e, d) appear in the same reduced sentence. The maximum-likelihood estimate of F(RI (a, b), (c, d) ) is: C(R, (a, b), (c, d) ) (7) fi'(Rl<a ,b), <c,d) )= C( (a,b), (c,d) ) We can now make the following approximation:</Paragraph> <Paragraph position="3"> eNote that we count multiple co-occurrences in a single sentence, e.g. if 3=(<a,b>,<c,d>,<c,d>) then C(< a,b >,< c,d >) = C(< c,d >,< a,b >) = 2.</Paragraph> <Paragraph position="4"> where 79 is the set of all triples of non-terminals. The denominator is a normalising factor which ensures</Paragraph> <Paragraph position="6"> The denominator of (9) is constant, so maximising P(D\[S, B) over D for fixed S, B is equivalent to maximising the product of the numerators, Af(DIS, B).</Paragraph> <Paragraph position="7"> (This considerably simplifies the parsing process):</Paragraph> <Paragraph position="9"/> </Section> <Section position="3" start_page="186" end_page="187" type="sub_section"> <SectionTitle> 2.3 The Distance Measure </SectionTitle> <Paragraph position="0"> An estimate based on the identities of the two tokens alone is problematic. Additional context, in particular the relative order of the two words and the distance between them, will also strongly influence the likelihood of one word modifying the other. For example consider the relationship between 'sales' and the three tokens of 'of': Example 2 Shaw, based in Dalton, Ga., has annual sales of about $1.18 billion, and has economies of scale and lower raw-material costs that are expected to boost the profitability of Armstrong's brands, sold under the Armstrong and Evans-Black names .</Paragraph> <Paragraph position="1"> In this sentence 'sales' and 'of' co-occur three times. The parse tree in training data indicates a relationship in only one of these cases, so this sentence would contribute an estimate of 1/2 that the two words are related. This seems unreasonably low given that 'sales of' is a strong collocation. The latter two instances of 'of' are so distant from 'sales' that it is unlikely that there will be a dependency.</Paragraph> <Paragraph position="2"> This suggests that distance is a crucial variable when deciding whether two words are related. It is included in the model by defining an extra 'distance' variable, A, and extending C, F and /~ to include this variable. For example, C( (a, b), (c, d), A) is the number of times (a, b) and (c, d) appear in the same sentence at a distance A apart. (11) is then maximised instead of (10):</Paragraph> <Paragraph position="4"> A simple example of Aj,hj would be Aj,hj = hj - j.</Paragraph> <Paragraph position="5"> However, other features of a sentence, such as punctuation, are also useful when deciding if two words are related. We have developed a heuristic 'distance' measure which takes several such features into account The current distance measure Aj,h~ is the combination of 6 features, or questions (we motivate the choice of these questions qualitatively - section 4 gives quantitative results showing their merit): Question 1 Does the hjth word precede or follow the jth word? English is a language with strong word order, so the order of the two words in surface text will clearly affect their dependency statistics.</Paragraph> <Paragraph position="6"> Question 2 Are the hjth word and the jth word adjacent? English is largely right-branching and head-initial, which leads to a large proportion of dependencies being between adjacent words 7. Table 1 shows just how local most dependencies are.</Paragraph> <Paragraph position="7"> tween the head words involved. These figures count baseNPs as a single word, and are taken from WSJ verbs between the head words involved.</Paragraph> <Paragraph position="8"> Question 3 Is there a verb between the hjth word and the jth word? Conditioning on the exact distance between two words by making Aj,hj = hj - j leads to severe sparse data problems. But Table 1 shows the need to make finer distance distinctions than just whether two words are adjacent. Consider the prepositions 'to', 'in' and 'of' in the following sentence: Example 3 Oil stocks escaped the brunt of Friday's selling and several were able to post gains , including Chevron , which rose 5/8 to 66 3//8 in Big Board composite trading of 2.4 million shares. The prepositions' main candidates for attachment would appear to be the previous verb, 'rose', and the baseNP heads between each preposition and this verb. They are less likely to modify a more distant verb such as 'escaped'. Question 3 allows the parser to prefer modification of the most recent verb - effectively another, weaker preference for right-branching structures. Table 2 shows that 94% of dependencies do not cross a verb, giving empirical evidence that question 3 is useful.</Paragraph> <Paragraph position="9"> ZFor example in '(John (likes (to (go (to (University (of Pennsylvania)))))))' all dependencies are between adjacent words.</Paragraph> <Paragraph position="10"> Questions 4, 5 and 6 * Are there 0, 1, 2, or more than 2 'commas' between the hith word and the jth word? (All symbols tagged as a ',' or ':' are considered to be 'commas').</Paragraph> <Paragraph position="11"> * Is there a 'comma' immediately following the first of the hjth word and the jth word? * Is there a 'comma' immediately preceding the second of the hjth word and the jth word? People find that punctuation is extremely useful for identifying phrase structure, and the parser described here also relies on it heavily. Commas are not considered to be words or modifiers in the dependency model - but they do give strong indications about the parse structure. Questions 4, 5 and 6 allow the parser to use this information.</Paragraph> </Section> <Section position="4" start_page="187" end_page="187" type="sub_section"> <SectionTitle> 2.4 Sparse Data </SectionTitle> <Paragraph position="0"> The maximum likelihood estimator in (7) is likely to be plagued by sparse data problems -C( (,.~j, {j), (wa~,{h,), Aj,h i) may be too low to give a reliable estimate, or worse still it may be zero leaving the estimate undefined. (Collins 95) describes how a backed-off estimation strategy is used for making prepositional phrase attachment decisions. The idea is to back-off to estimates based on less context.</Paragraph> <Paragraph position="1"> In this case, less context means looking at the POS tags rather than the specific words.</Paragraph> <Paragraph position="2"> There are four estimates, El, E2, Ea and E4, based respectively on: 1) both words and both tags; 2) ~j and the two POS tags; 3) ~hj and the two</Paragraph> <Paragraph position="4"> where Y is the set of all words seen in training data: the other definitions of C follow similarly.</Paragraph> <Paragraph position="5"> Estimates 2 and 3 compete - for a given pair of words in test data both estimates may exist and they are equally 'specific' to the test case example.</Paragraph> <Paragraph position="6"> (Collins 95) suggests the following way of combining them, which favours the estimate appearing more often in training data:</Paragraph> <Paragraph position="8"> This gives three estimates: El, E2a and E4, a similar situation to trigram language modeling for speech recognition (Jelinek 90), where there are trigram, bigram and unigram estimates. (Jelinek 90) describes a deleted interpolation method which combines these estimates to give a 'smooth' estimate, and the model uses a variation of this idea: If E1 exists, i.e. 61 > 0</Paragraph> <Paragraph position="10"> (Jelinek 90) describes how to find A values in (15) and (16) which maximise the likelihood of held-out data. We have taken a simpler approach, namely:</Paragraph> <Paragraph position="12"> These A vMues have the desired property of increasing as the denominator of the more 'specific' estimator increases. We think that a proper implementation of deleted interpolation is likely to improve results, although basing estimates on co-occurrence counts alone has the advantage of reduced training times.</Paragraph> </Section> <Section position="5" start_page="187" end_page="188" type="sub_section"> <SectionTitle> 2.5 The BaseNP Model </SectionTitle> <Paragraph position="0"> The overall model would be simpler if we could do without the baseNP model and frame everything in terms of dependencies. However the baseNP model is needed for two reasons. First, while adjacency between words is a good indicator of whether there is some relationship between them, this indicator is made substantially stronger if baseNPs are reduced to a single word. Second, it means that words internal to baseNPs are not included in the co-occurrence counts in training data. Otherwise, in a phrase like 'The Securities and Exchange Commission closed yesterday', pre-modifying nouns like 'Securities' and 'Exchange' would be included in co-occurrence counts, when in practice there is no way that they can modify words outside their baseNP.</Paragraph> <Paragraph position="1"> The baseNP model can be viewed as tagging the gaps between words with S(tart), C(ontinue), E(nd), B(etween) or N(ull) symbols, respectively meaning that the gap is at the start of a BaseNP, continues a BaseNP, is at the end of a BaseNP, is between two adjacent baseNPs, or is between two words which are both not in BaseNPs. We call the gap before the ith word Gi (a sentence with n words has n - 1 gaps). For example, \[ 3ohn Smith \] \[ the president \] of \[ IBM \] has announced \[ his resignation \] \[ yesterday \] =~ John C Smith B the C president E of S IBM E has N announced S his C resignation B yesterday The baseNP model considers the words directly to the left and right of each gap, and whether there is a comma between the two words (we write ci = 1 if there is a comma, ci = 0 otherwise). Probability estimates are based on counts of consecutive pairs of words in unreduced training data sentences, where baseNP boundaries define whether gaps fall into the S, C, E, B or N categories. The probability of a baseNP sequence in an unreduced sentence S is then:</Paragraph> <Paragraph position="3"> The estimation method is analogous to that described in the sparse data section of this paper. The method is similar to that described in (Ramshaw and Marcus 95; Church 88), where baseNP detection is also framed as a tagging problem.</Paragraph> </Section> <Section position="6" start_page="188" end_page="188" type="sub_section"> <SectionTitle> 2.6 Summary of the Model </SectionTitle> <Paragraph position="0"> The probability of a parse tree T, given a sentence S, is:</Paragraph> <Paragraph position="2"> The denominator in Equation (9) is not actually constant for different baseNP sequences, hut we make this approximation for the sake of efficiency and simplicity. In practice this is a good approximation because most baseNP boundaries are very well defined, so parses which have high enough P(BIS ) to be among the highest scoring parses for a sentence tend to have identical or very similar baseNPs.</Paragraph> <Paragraph position="3"> Parses are ranked by the following quantityg:</Paragraph> <Paragraph position="5"> Equations (19) and (11) define P(B\]S) and Af(DIS, B).</Paragraph> <Paragraph position="6"> The parser finds the tree which maximises (20) subject to the hard constraint that dependencies cannot cross.</Paragraph> <Paragraph position="7"> 9in fact we also model the set of unary productions, U, in the tree, which are of the form P -~< Ca >. This introduces an additional term, P(UIB , S), into (20).</Paragraph> </Section> <Section position="7" start_page="188" end_page="188" type="sub_section"> <SectionTitle> 2.7 Some Further Improvements to the Model </SectionTitle> <Paragraph position="0"> This section describes two modifications which improve the model's performance.</Paragraph> <Paragraph position="1"> * In addition to conditioning on whether dependencies cross commas, a single constraint concerning punctuation is introduced. If for any constituent Z in the chart Z --+ <.. X Y= . . > two of its children X and Y= are separated by a comma, then the last word in Y= must be directly followed by a comma, or must be the last word in the sentence. In training data 96% of commas follow this rule. The rule also has the benefit of improving efficiency by reducing the number of constituents in the chart.</Paragraph> <Paragraph position="2"> * The model we have described thus far takes the single best sequence of tags from the tagger, and it is clear that there is potential for better integration of the tagger and parser. We have tried two modifications. First, the current estimation methods treat occurrences of the same word with different POS tags as effectively distinct types. Tags can be ignored when lexical information is available by defining</Paragraph> <Paragraph position="4"> where 7&quot; is the set of all tags. Hence C (a, c) is the number of times that the words a and c occur in the same sentence, ignoring their tags. The other definitions in (13) are similarly redefined, with POS tags only being used when backing off from lexical information. This makes the parser less sensitive to tagging errors.</Paragraph> <Paragraph position="5"> Second, for each word wi the tagger can provide the distribution of tag probabilities P(tiIS) (given the previous two words are tagged as in the best overall sequence of tags) rather than just the first best tag. The score for a parse in equation (20) then has an additional term, 1-\[,'=l P(ti IS), the product of probabilities of the tags which it contains.</Paragraph> <Paragraph position="6"> Ideally we would like to integrate POS tagging into the parsing model rather than treating it as a separate stage. This is an area for future research.</Paragraph> </Section> </Section> <Section position="5" start_page="188" end_page="189" type="metho"> <SectionTitle> 3 The Parsing Algorithm </SectionTitle> <Paragraph position="0"> The parsing algorithm is a simple bottom-up chart parser. There is no grammar as such, although in practice any dependency with a triple of non-terminals which has not been seen in training data will get zero probability. Thus the parser searches through the space of all trees with non-terminal triples seen in training data. Probabilities of baseNPs in the chart are calculated using (19), while probabilities for other constituents are derived from the dependencies and baseNPs that they contain. A dynamic programming algorithm is used: if two proposed constituents span the same set of words, have the same label, head, and distance from with the punctuation rule described in section 2.7; (3) is model (2) with POS tags ignored when lexical information is present; (4) is model (3) with probability distributions from the POS tagger. LI:t/LP = labeled recall/precision. CBs is the average number of crossing brackets per sentence. 0 CBs, ~ 2 CBs are the percentage of sentences with 0 or < 2 crossing brackets respectively. join to form a new constituent. Each operation gives two new probability terms: one for the baseNP gap tag between the two constituents, and the other for the dependency between the head words of the two constituents.</Paragraph> <Paragraph position="1"> the head to the left and right end of the constituent, then the lower probability constituent can be safely discarded. Figure 4 shows how constituents in the chart combine in a bottom-up manner.</Paragraph> </Section> class="xml-element"></Paper>