File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/a88-1019_metho.xml
Size: 16,071 bytes
Last Modified: 2025-10-06 14:12:07
<?xml version="1.0" standalone="yes"?> <Paper uid="A88-1019"> <Title>Probability of Ending a Noun Phrase AT NN NNS VB IN AT NN NNS VB</Title> <Section position="2" start_page="137" end_page="138" type="metho"> <SectionTitle> 2. Lexical Disambiguation Rules </SectionTitle> <Paragraph position="0"> However, the proposed stochastic method is considerably simpler than what Marcus had in mind. His thesis parser used considerably more syntax than the proposed stochastic method.</Paragraph> <Paragraph position="1"> Consider the following pair described in \[Marcus\]: * Have/VB \[the students who missed the exam\] TAKE the exam today. (imperative) . Have/AUX \[the students who missed the exam\] TAKEN the exam today? (question) where it appears that the parser needs to look past an arbitrarily long noun phrase in order to correctly analyze &quot;have,&quot; which could be either a tenseless main verb (imperative) or a tensed auxiliary verb (question). Marcus' rather unusual example can no longer be handled by Fidditch, a more recent Marcus-style parser with very large coverage. In order to obtain such large coverage, Fidditch has had to take a more robust/modest view of lexical disambiguation.</Paragraph> <Paragraph position="2"> Whereas Marcus' Parsifal program distinguished patterns such as &quot;have NP tenseless&quot; and &quot;have NP past-participle,&quot; most of Fidditch's diagnostic rules axe less ambitious and look only for the start of a noun phrase and do not attempt to look past an arbitrarily long noun phrase. For example, Fidditch has the following lexical disambiguation rule: * (defiule n+prep! &quot; > \[**n+prep\] != n \[npstarters\]&quot;) which says that a preposition is more likely than a noun before a noun phrase. More precisely, the rule says that ff a noun/preposition ambiguous word (e.g., &quot;out&quot;) is followed by something that starts a noun phrase (e.g., a determiner), then rule out the noun possibility. This type of lexical diagnostic rule can be captured with bigram and trigram statistics; it turns out that the sequence ...preposition determiner.., is much more common in the Brown Corpus (43924 observations) than the sequence ...noun determiner... (1135 observations). Most lexical disambiguation rules in Fidditch can be reformulated in terms of bigram and trigram statistics in this way.</Paragraph> <Paragraph position="3"> Moreover, it is worth doing so, because bigram and trigram statistics are much easier to obtain than Fidditch-type disambiguation rules, which are extremely tedious to program, test and debug.</Paragraph> <Paragraph position="4"> In addition, the proposed stochastic approach can naturally take advantage of lexical probabilities in a way that is not easy to capture with parsers that do not make use of frequency information.</Paragraph> <Paragraph position="5"> Consider, for example, the word &quot;see,&quot; which is almost always a verb, but does have an archaic nominal usage as in &quot;the Holy See.&quot; For practical purposes, &quot;see&quot; should not be considered noun/verb ambiguous in the same sense as truly ambiguous words like &quot;program,&quot; &quot;house&quot; and &quot;wind&quot;; the nominal usage of &quot;see&quot; is possible, but not likely.</Paragraph> <Paragraph position="6"> If every possibility in the dictionary must be given equal weight, parsing is very difficult.</Paragraph> <Paragraph position="7"> Dictionaries tend to focus on what is possible, not on what is likely. Consider the trivial sentence, &quot;I see a bird.&quot; For all practical purposes, every word in the sentence is unambiguous. According to \[Francis and Kucera\], the word 'T' appears as a pronoun (PPLS) in 5837 out of 5838 observations (-100%), &quot;see&quot; appears as a verb in 771 out of 772 observations C100%), &quot;a&quot; appears as an article in 23013 out of 23019 observations ('100%) and &quot;bird&quot; appears as a noun in 26 out of 26 observations C100%). However, according to Webster's Seventh New Collegiate Dictionary, every word is ambiguous. In addition to the desired assignments of tags, the first thee words are listed as nouns and the last as an intransitive verb. One might hope that these spurious assignments could be ruled out by the parser as syntactically ill-formed. Unfortunately, this is nnlikely to work. If the parser is going to accept noun phrases of the form: * \[N'P IN city\] IN school\] IN committee\] IN meeting\]\] then it can't rule out</Paragraph> <Paragraph position="9"> Similarly, the parser probably also has to accept &quot;bird&quot; as an intransitive verb, since there is nothing syntactically wrong with: * \[S \[NP \[N I\] \[N see\] \[1'4 a\]\] \[V1 a \[V bird\]\]\] These part of speech assignments aren't wrong; they are just extremely improbable.</Paragraph> </Section> <Section position="3" start_page="138" end_page="138" type="metho"> <SectionTitle> 3. The Proposed Method </SectionTitle> <Paragraph position="0"> Consider once again the sentence, &quot;I see a bird.&quot; The problem is to find an assignment of parts of speech to words that optimizes both lexical and contextual probabilities, both of which are estimated from the Tagged Brown Corpus. The lexical probabilities axe estimated from the following frequencies:</Paragraph> <Paragraph position="2"> The lexical probabilities are estimated in the obvious way. For example, the probability that 'T' is a pronoun, Prob(PPSS\['T'), is estimated as the freq(PPSSl'T')/freq('T') or 5837/5838.</Paragraph> <Paragraph position="3"> The probability that &quot;see&quot; is a verb is estimated to be 771/772. The other lexical probability estimates follow the same pattern.</Paragraph> <Paragraph position="4"> The contextual probability, the probability of observing part of speech X given the following two parts of speech Y and Z, is estimated by dividing the trigram frequency XYZ by the bigram frequency YZ. Thus, for example, the probability of observing a verb before an article and a noun is estimated to be the ratio of the freq(VB, AT, NN) over the freq(AT, NN) or 3412/53091 = 0.064. The probability of observing a noun in the same context is estimated as the ratio of freq(NN, AT, NN) over 53091 or 629/53091 = 0.01. The other contextual probability estimates follow the same pattern.</Paragraph> <Paragraph position="5"> A search is performed in order to find the assignment of part of speech tags to words that optimizes the product of the lexical and contextual probabilities. Conceptually, the search enumerates all possible assignments of parts of speech to input words. In this case, there are four input words, three of which are two ways ambiguous, producing a set of 2*2*2* 1=8 possible assignments of parts of speech to input words: I see a bird</Paragraph> <Paragraph position="7"> Each of the eight sequences are then scored by the product of the lexical probabilities and the contextual probabilities, and the best sequence is selected. In this case, the first sequence is by far the best.</Paragraph> <Paragraph position="8"> In fact, it is not necessary to enumerate all possible assignments because the scoring function cannot see more than two words away.</Paragraph> <Paragraph position="9"> In other words, in the process of enumerating part of speech sequences, it is possible in some cases to know that some sequence cannot possibly compete with another and can therefore be abandoned. Because of this fact, only O(n) paths will be enumerated. Let us illustrate this optimization with an example: Find all assignments of parts of speech to &quot;bird&quot; and score the partial sequence. Henceforth, all scores are to be interpreted as log probabilities.</Paragraph> </Section> <Section position="4" start_page="138" end_page="140" type="metho"> <SectionTitle> (-4.848072 &quot;NN&quot;) </SectionTitle> <Paragraph position="0"> Now, find assignments of &quot;I&quot; and score. Note, however, that it is no longer necessary to hypothesize that &quot;a&quot; might be a French preposition IN because all four paths, PPSS VB</Paragraph> <Paragraph position="2"> path and there is no way that any additional input could make any difference. In particular, the path, PPSS VB IN NN scores less well than the path PPSS VB AT NN, and additional input will not help PPSS VB IN NN because the contextual scoring function has a limited window of three parts of speech, and that is not enough to see past the existing PPSS and VB.</Paragraph> <Paragraph position="3"> The search continues two more iterations, assuming blank parts of speech for words out of range.</Paragraph> <Paragraph position="4"> (-13.262333 ...... PPSS&quot; &quot;VB&quot; &quot;AT&quot; &quot;NN&quot;) (-26.5196 ...... NP&quot; &quot;VB .... AT&quot; &quot;NN&quot;) F'mally, the result is: PPSS VB AT NN.</Paragraph> <Paragraph position="5"> (-12.262333 .......... PPSS .... VB&quot; &quot;AT&quot; &quot;NN&quot;) The final result is: I/PPSS see/VB a/AT bird/NN. A slightly more interesting example is: &quot;Can Similar stochastic methods have been applied to locate simple noun phrases with very high accuracy. The program inserts brackets into a sequence of parts of speech, producing output such as:</Paragraph> <Paragraph position="7"> The proposed method is a stochastic analog of precedence parsing. Recall that precedence parsing makes use of a table that says whether to insert an open or close bracket between any two categories (terminal or nonterminal). The proposed method makes use of a table that gives the probabilities of an open and close bracket between all pairs of parts of speech. A sample is shown below for the five parts of speech: AT (article), NN (singular noun), NNS (non-singular noun), VB (uninflected verb), IN (preposition).</Paragraph> <Paragraph position="8"> The table says, for example, that there is no chance of starting a noun phrases after an article (all five entries on the AT row are O) and that there is a large probability of starting a noun phrase between a verb and an noun (the entry in</Paragraph> <Paragraph position="10"> These probabilities were estimated from about 40,000 words (11,000 noun phrases) of training material selected from the Brown Corpus. The training material was parsed into noun phrases by laborious semi-automatic means (with considerable help from Eva Ejerhed). It took about a man-week to prepare the training material.</Paragraph> <Paragraph position="11"> The stochastic parser is given a sequence of parts of speech as input and is asked to insert brackets corresponding to the beginning and end of noun phrases. Conceptually, the parser enumerates all possible parsings of the input and scores each of them by the precedence probabilities. Consider, for example, the input sequence: bin VB. There are 5 possible ways to bracket this sequence Each of these parsings is scored by multiplying 6 precedence probabilities, the probability of an open/close bracket appearing (or not appearing) in any one of the three positions (before the NN, after the NN or after the VB). The parsing with the highest score is returned as output.</Paragraph> <Paragraph position="12"> A small sample of the output is given in the appendix. The method works remarkably well considering how simple it is. There is some tendency to underestimate the number of brackets and nan two noun phrases together as in \[NP the time Fairchild\]. The proposed method omitted only 5 of 243 noun phrase brackets in the appendix.</Paragraph> </Section> <Section position="5" start_page="140" end_page="141" type="metho"> <SectionTitle> 5. Smoothing Issues </SectionTitle> <Paragraph position="0"> Some of the probabilities are very hard to estimate by direct counting because of ZipFs Law (frequency is roughly proportional to inverse rank). Consider, for example, the lexical probabilities. We need to estimate how often each word appears with each part of speech.</Paragraph> <Paragraph position="1"> Unfoaunately, because of ZipFs Law, no matter how much text we look at, there will always be a large tail of words that appear only a few times. In the Brown Corpus, for example, 40,000 words appear five times or less. If a word such as &quot;yawn&quot; appears once as a noun and once as a verb, what is the probability that it can be an adjective? It is impossible to say without more information. Fortunately, conventional dictionaries can help alleviate this problem to some extent. We add one to the frequency count of possibilities in the dictionary. For example, &quot;yawn&quot; happens to be listed in our dictionary as noun/verb ambiguous. Thus, we smooth the frequency counts obtained from the Brown Corpus by adding one to both possibilities. In this case, the probabilities remain unchanged. Both before and after smoothing, we estimate &quot;yawn&quot; to be a noun 50% of the time, and a verb the rest. There is no chance that &quot;yawn&quot; is an adjective.</Paragraph> <Paragraph position="2"> In some other cases, smoothing makes a big difference. Consider the word &quot;cans.&quot; This word appears 5 times as a plural noun and never as a verb in the Brown Corpus. The lexicon (and its morphological routines), fortunately, give both possibilities. Thus, the revised estimate is that &quot;cans&quot; appears 6/7 times as a plural noun and 1/7 times as a verb.</Paragraph> <Paragraph position="3"> Proper nouns and capitalized words are particularly problematic; some capitalized words are proper nouns and some are not. Estimates from the Brown Corpus can be misleading. For example, the capitalized word &quot;Acts&quot; is found twice in the Brown Corpus, both times as a proper noun (in a tide). It would be a mistake to infer from this evidence that the word &quot;Acts&quot; is always a proper noun. For this reason, capitalized words with small frequency counts (< 20) were thrown out of the lexicon.</Paragraph> <Paragraph position="4"> There are other problems with capitalized words.</Paragraph> <Paragraph position="5"> Consider, for example, a sentence beginning with the capitalized word &quot;Fall&quot;; what is the probability that it is a proper noun (i.e., a surname)? Estimates from the Brown Corpus are of little help here since &quot;Fall&quot; never appears as a capitalized word and it never appears as a proper noun. Two steps were taken to alleviate this problem. First, the frequency estimates for &quot;Fall&quot; are computed from the estimates for &quot;fall&quot; plus 1 for the proper noun possibility. Thus, &quot;Fall&quot; has frequency estimates of: ((1 . &quot;NP&quot;) (1 &quot;JJ&quot;) (65 &quot;VB&quot;) (72 . &quot;NN&quot;)) because &quot;fall&quot; has the estimates of: ((1 . &quot;JJ&quot;) (65 . &quot;VB&quot;) (72 . &quot;NN&quot;)). Secondly, a prepass was introduced which labels words as proper nouns if they are &quot;adjacent to&quot; other capitalized words (e.g., &quot;White House,&quot; &quot;State of the Union&quot;) or if they appear several times in a discourse and are always capitalized.</Paragraph> <Paragraph position="6"> The lexical probabilities are not the only probabilities that require smoothing. Contextual frequencies also seem to follow Zipf's Law.</Paragraph> <Paragraph position="7"> That is, for the set of all sequences of three parts of speech, we have plotted the frequency of the sequence against its rank on log log paper and observed the classic (approximately) linear relationship and slope of (almost) -1. It is clear that the contextual frequencies require smoothing. Zeros should be avoided.</Paragraph> </Section> class="xml-element"></Paper>