File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1071_metho.xml
Size: 24,566 bytes
Last Modified: 2025-10-06 14:12:06
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-1071"> <Title>SPEECH RECOGNITION AND THE FREQUENCY OF RECENTLY USED WORDS A MODIFIED MARKOV MODEL FOR NATURAL LANGUAGE</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> SPEECH RECOGNITION AND THE FREQUENCY OF RECENTLY USED WORDS A MODIFIED MARKOV MODEL FOR NATURAL LANGUAGE </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Speech recognition systems incorporate a language model which, at each stage of the recognition task, assigns a probability of occurrence to each word in the vocabulary. A class of Markov langnage models identified by Jclinek has achieved consider-.</Paragraph> <Paragraph position="1"> able success in this domain. A modification of the Markov approach, wblch assigns higher probabilities to recently used words, is proposed and tested against a pure Markov model.</Paragraph> <Paragraph position="2"> Parameter calculation and comparison of the two models both involve use of the LOB CorPus of tagged modern English.</Paragraph> </Section> <Section position="3" start_page="0" end_page="348" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Speech recognition systems consist of two components. An acoustic component matches the most recent acoustic input to words in its vocabulary, producing a list of the most plausible word candidates together with a probability for each. The second component, which incorporates a language model, utilizes the string of previously identified words to estimate for each word in the vocabulary the probability that it will occur next. Each word candidate originally selected by the acoustic component is thus associated with two probabilities, the first based on its resemblance to the observed signal and the second based on the linguistic plausibility of that word occurring immediately after the previously recognized words. Multiplication of these two probabilities produces an overall probability for each word candidate.</Paragraph> <Paragraph position="1"> Our work focuses on the language model incorporated in the second component. The language model we use is based on a class of Markov models identified by Jelinek, the &quot;n-gram&quot; and &quot;Mg-gram&quot; models \[Jelinek 1985, 1983\]. These models, whose parmneters are calculated from a large training text, produce a reasonable non-zero probability for every word in the vocabulary during every stage of the speech recognition task. Our model incorporates both a Markov 3g-gram component and an added &quot;cache&quot; component which tracks short-term fluctuations in word frequency.</Paragraph> <Paragraph position="2"> We adopted the hypothesis that a word used in the recent past is much more likely to be used soon than either its overall frequency in the language or a Markov model would suggest.</Paragraph> <Paragraph position="3"> The cache component of our model estimates the probability of a word from its recent frequency of use. The overall model uses a weighted average of the Markov and cache components in calculating word probabilities, where the relative weights assigned to each component depend on the part of speech (POS).</Paragraph> <Paragraph position="4"> For each POS, the overall model may therefore place more reliance on the cache component than on the Markov component, or vice veins; the relative weights arc obtained empirically for each POS from a training text. This dependance on POS arises from the hypothesis that a content word, such as a particular noun or verb, will occur in bursts. Function wm'ds, on the other hand, would be spread more evenly across a text or a conversation; their short-term frequencies of use would vary less dramatically from their long-term frequencies. One of the aims of our research was to assess this hypothesis empirically. If it is correct, the relative weight calculated from the training text for the cache component for most content POSs will be higher than the cache weighting for most flmction POSs.</Paragraph> <Paragraph position="5"> We intend to compare the pcrfor.mance of a standard 3g-gram Markov model with that of our model \[containing the same Markov model along with a cache component) in calculating the probability of 100 texts, each approximately 2000 words long. The texts are taken from the Lancaster-Oats/Bergen (LOB) Corpus of modern English \[Johansson et al 1988, 1982\]; the rest of the corpus is utilized as a training text which determines the parameters of both models. Comparison of the two sets of probabilities will allow one to assess the extent of improvement over the pure Ma,kov model acifieved by adding a cache component. Furthermore, the relative weigbts calculated from the training text for the two components of the combined model indicate tlmse POSs for which short-term frequencies of word use differ drastically from long-term frequencies, and those for which word frequencies Stay nearly constant over time.</Paragraph> </Section> <Section position="4" start_page="348" end_page="348" type="metho"> <SectionTitle> 2 A Natural Language Model with Markov ~.nd </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="348" end_page="348" type="sub_section"> <SectionTitle> Cache Components </SectionTitle> <Paragraph position="0"> The &quot;trigram &quot; Markov language model for speech recognition developed by F. Jelinek and his colleagues uses the context provided by the two preceding words to estimate the probability that the word W i occurring at time i is a given vocabulary item W. Assume rccursivcly that at time i we have just recognized the word sequence W 0 .&quot; &quot; ,Wi_ 2 Wi__ 1. The trigram model approximatss P (Wi : W \] Wo, * &quot; * , Wi_2, W~_I) by f (W~= W \[ W~_2, W~-I) &quot;whets the frequencies f are calculated from a huge &quot;training text&quot; before the recogaition task takes place.</Paragraph> <Paragraph position="1"> One adaptation of the trigram model employs trigrams of POSs to predict the POS of W i , and frequency of words within each POS to predict W i itself. Thus, this &quot;3g-gram&quot; model gives</Paragraph> <Paragraph position="3"> Here G denotes the set of all parts of speech, gj denotes a particular part of speech, and g (Wi) denotes the part of speech catego~7 t6 which word W i belongs (abbreviated to gi from now on); f denotes a frequency calculated from the training text.</Paragraph> <Paragraph position="4"> This &quot;Sg-gram&quot; model was used by Derouault and Merialdo for French language modeling \[Derouault and Merialdo 1986, 19841, and forms the Markov component of our own model. In practice many POS triplets will never appear in the training text but will appear during the recognition task, so Derouanlt and Merialdo use a weighted average of triplet and doublet POS frequencies plus a low arbitrary constant to prevent zero estimates for the probability of occurrence of a given POS :</Paragraph> <Paragraph position="6"> The parameters ILl 2 are not constant but can be made to depend on the count of occurrences of the sequence gi~.2,yl_llor on the POS of the preceding word, gi-1. In either Case thgse parameters most sum to 0.9999 and can be optimized iteratively; Deronault and Meriatdo .~'ound that the two weighting methods performed equMly well.</Paragraph> <Paragraph position="7"> The 3g-gram component of our model is almost identical to that of Derouault.and Merialdo, although the 153 POSs we use are those of the LOB Corpus. We let l 1 and 12 depend on the preceding POS gi-1. The cache component keeps track of ~he recent frequencies of words within each POS; it assigns high probabilities to recently used words. Now, let Cj(W,i) denote the caehc-based probability of word W at time i for POS gj ff g (W) ~ gY then Gj(W,i) -=0 at all times i, i.e. if W does not belong to POS gi, its cache-based probability for that POS is always 0. Similarly, let My(W) denote the Markov probability due to the rest of the pure 3g-gram Mackov ~a-mdeL This is approximated by ii(W) ~ f (Wi~W \]g(Wi) =gj), i.e, the frequency of word W among all words with POS ~ gj in the trainin~ text.</Paragraph> <Paragraph position="8"> The final, combined model is then P( W i --=W) =</Paragraph> <Paragraph position="10"> \]\]ere k M \] &quot;4- k~ j ~1; k M 1 denotes the weighting given to the &quot;frequen'~y within POS ~' component and kc, i the weighting of the &quot;eaebe~based probability&quot; component of ~OS gj. One would ~peet relatively ,insensitive&quot; POSs, whose constituent words do not vary much in frequency over time, to have high values of kM, j and low values of k v j; the reverse should be true for &quot;sensitive&quot; POSs. As is 'described in the next section, approximate values k6. J aud kMj were determined empirically for two POSs gj to see if these expectations were correct.</Paragraph> <Paragraph position="11"> Th~e cache-bnsed probabilities C\](W,i) were calculated as followt~. For each POS, a &quot;cache&quot; (just a buffer) with. room for 200 words is maintained. Each new word is assigned to a single POS'gj and pushed into the corresponding buffer. As soon as there are 5 words in a cache, it begins to output probabilities which correspond to the relative proportions of words it contains. The lower limit of 5 on the size of the cache before it t~tarts producing probabilities, and the upper size limit of 200, are arbitrary; there are many possible heuristics for producing cache-based probabilities.</Paragraph> <Paragraph position="12"> 3 hnplementatto~a and Testing of the Combined</Paragraph> </Section> <Section position="2" start_page="348" end_page="348" type="sub_section"> <SectionTitle> Model 3.1 The LOB Corpus </SectionTitle> <Paragraph position="0"> The Laneaster-Oslo/Bergen Corpus of British English consists of 500 samples of about 2000 words each; each word in tile corpus is tagged with exactly one of 153 POSs. The samples were extracted from texts published in Britain in 1981, and have been grouped by the LOB researchers into 15 categories spanning a wilde range of English prose \[Joban~son et al 1086, 1982\]. We split the i;agged LOB Corpus into two unequal parts, one of which aslTed as a training text for our models and the other of which was used to test and compare them. The comprehensiveness of the LOB Corpus made it an ideal training text and a tough test of the robustness of the language model.</Paragraph> <Paragraph position="1"> Fnrthermore, the fact that it has been tagged by an expert team of gramm:~rians and lexicographers freed us from having to devise onr own tagging procedure.</Paragraph> </Section> <Section position="3" start_page="348" end_page="348" type="sub_section"> <SectionTitle> 3.2 t)arameter Calculation </SectionTitle> <Paragraph position="0"> 400 sample texts form the training text used for parameter calculation; the remaining 100 samples form a testing text used for testing and comparison of the pure 3g-gram model with the combined lnodel. Samples were allocated to the training text and the testing text in a rammer that ensured that each had similar proportions of samples belonging to the 15 categories identified by the LOB researchers. All parameters for both tile pure 3g~ gram model and the combined model were calculated from the 400-sample training text.</Paragraph> <Paragraph position="1"> The two models share a POS prediction component wlfich is estimated by the Derouanlb-Merialdo method. Triplet and doublet POS frequencies were obtained from 75% (300 of the 400 samples) of the training text; the remaining 25% (100 samples) gave the weights, ll(gi_l) and 12(gi_l) , needed for smoothing between th~se two frequencies. These were computed iteratively using the Forward-Backward algorithm ( Derouault and Merialdo \[1~i88\], Rabiner and Juang \[1986\]).</Paragraph> <Paragraph position="2"> Now ~,he pure 3g-gram model is complete - it remains to find kg,.i and k,jd for the combined model. This can be calculated by means of the Forward-Backward method from the 400 samples.</Paragraph> </Section> <Section position="4" start_page="348" end_page="348" type="sub_section"> <SectionTitle> 8.3 Testing the Combined Model </SectionTitle> <Paragraph position="0"> As dc.~cribed in 4.2, 80% of the LOB Corpus is used to find tile best-fit parameters for a. the pure 3g-gram model b. the combined model, made up of the 3g-gram model plus a cache component. These two models will then be tested on the remaining ~l% of the LOB Corpus as follows. Each is given this portion of the LOB Corpus word by word, calculating the probability .f each word as it goes along. The probability of this sequence of ~Lbout 200,{D0 words as estimated by either model is simply the product of the~,iudividnal w0rd i probabilities as increase achieved by the latter over the former is the measure of the improvemen t due to !addition of ~he Cache'component.</Paragraph> <Paragraph position="1"> Note that in order to calculate word probabilitir~, both models must have guessed the POSs of the two preceding words.</Paragraph> <Paragraph position="2"> Thus every word encountered must be assigned a POS. There are three cases : a). the word did not occur in the tagge d training text and therefore is not in the vocabulary; b). the word was in the training text, and had tim sanie tag wherever it occurred; c). the word was in the training text, and had more than one tag (e.g. the word &quot;light&quot; migbt have been tagged as a norm, verb, and adjective).</Paragraph> <Paragraph position="3"> The heuristics employed to assign tags were ns follows : a). in this ease, the two previous POSs are substituted &quot;in tile Derouault-Merialdo wcighted-average formula and the program tries all 153 possible tags to find the one that maximizes the probability given by the formula.</Paragraph> <Paragraph position="4"> b). in this ease, there is no choice; the tag chosen is the unique tag associated with the word in the training text.</Paragraph> <Paragraph position="5"> c). when the word has two or more po~ible tags, the tag choasn is the one which makes the largest contribution to ~he word's probability (i.e. which gives rise to the largczt component in the summation on pg. 1).</Paragraph> <Paragraph position="6"> Thus, although the portion of the LOB Corpus used for testing is tagged, these tags were not employed in the implementation of either model; in both eases the heuristics given above guessed POSs. A separate part of the program cmnpared actual tags with guessed ones in order to collect statistics on the performance of these heuristics.</Paragraph> </Section> </Section> <Section position="5" start_page="348" end_page="350" type="metho"> <SectionTitle> 4 Preliminary Results 1. The first results of our calculations are tile values </SectionTitle> <Paragraph position="0"> \[l(gi-1) and 12(gi_l) obtained iterativcly to optimize the weighting between the 19OS triplet frequency f (gl I gi-2,gi-1) and the POS doublet frequency f (gl \[ g/-1) in the estimation of P(m=gj \[m-2,m-~). A~ one might expect, ll(m-l) tends to be high relative to 12(gi-1) when gi-1 occurs often, because the ~ triplet frequency is quite reliable in this ease. For instance, the most frequent tag in the LOB Corpus is &quot;NN&quot;, singular common noun; we have II(NN ) ~ 0.61 . The tag &quot;HVG&quot;, attached only to the word &quot;having&quot;, is fairly rare; we have II(HVG ) =-: 0.13 . However, there are other factors to consider. Derouanlt and Merialdo state that for gi-I equal to an article, l I was relatively low because we need not know the POS gi-2 to predict that gl is a noun or adjective. Thus doublet frequencies alone were quite reliable in this case. On the other hand, when gi-I is a negation, knowing gl-2 was very important in making a prediction of gl, because of French phrases like &quot;il ne veut&quot; and Uje ne veux&quot;.</Paragraph> <Paragraph position="1"> Our results from English texts show somewhat different patterns. The tag &quot;AT&quot; for singulm&quot; articles bml an l 1 that was neither high nor low, 0.47 . The tag &quot;CC&quot; for coordinating conjunctions, including &quot;imt&quot;, had a high l I value, 0.80 . Adjectives (&quot;JJ&quot;) and adverbs (&quot;RB&quot;) had 11 values even higher ttmn one wouhl expect on the basis of their high frequencies of occurrence : 0.O0 and 0.86 respectively.</Paragraph> <Paragraph position="2"> 2. We collected statistics on the success rate of the pure Marker component in guessing the POS of the latest word (using the tag actually assigned the word in the LOB Corpus as the criterion). This rate has a powerful impact on the performance of both models, especially the one with a cache component; each incorrectly guessed POS leads to looking in the wrong cache and thus to a cache-bused probability of 0. We are particularly interested in forming an idea of how fast this success rate will increase as we increase the size of the training text.</Paragraph> <Paragraph position="3"> Of the words that had occurred at least once in the training text, 83.9 o~ had tags that were gue~ed correctly (ltL1 o~ incorrectly). Words that never occurred in the training text were assigned the correct tag only 22 o~ of the time (78 % incorrect). Apparently the informatiofi contained in the counts of POS triplets, doublets, and singlets is a good POS predictor when combined with some knowledge of the possible tags a word may have, but not nearly as good on its own.</Paragraph> <Paragraph position="4"> Among the words that appeared at least once in the training text, a surprisingly high proportion - 42.8 ~ - had more than one possible POS. Of these, 66.7 % had POSs that were guessed correctly, Thus it might appear that performance is degraded when the program ..must make a choice between pbssiblc tags. This analYSiS is faulty i a given word might have many POSs, and perhaps the correct one was not found in the training text at all. The most important statistic , therefore, is the proportion of words in the testing text who~e tag was guessed correctly among the words that had also appeared with the correct tag in the training text. This proportion is 94.0 %.</Paragraph> <Paragraph position="5"> It seems reasonable to regard this as being an indication of the upper limit for the success rate of POS prediction with training texts of manageable size; it provides an estimate of the success rate when the two main sources of error ( words found in the testing text but not the training text, words found in both texts which are tagged in the testing text with a POS not attached to them anywhere in the training text ) are eliminated.</Paragraph> <Paragraph position="6"> 3. We have not yet tested the full combined model ( with a cache component and a Markov component ) against the 3g-gram Marker model. However, we have examined the effect on the predictive power of the Marker model of including cache components for two POSs : singular common noun ( label &quot;NN&quot; in the LOB Corpus ) and preposition ( label &quot;IN&quot; in the LOB Corpus ). These two were chosen because they occur with high frequency in the Corpus, in which tllere are 148,759 occurrences of &quot;NN`' and 123,440 occurrences of &quot;IN&quot;, and because &quot;NN`' is a content word category and &quot;IN&quot; a fnnction word category. Thus they provide a means of testing the hypothesis outlined in the Introduction, that a cache component will increase predictive power for content POSs but not make much difference for function POSs.</Paragraph> <Paragraph position="7"> For both POSs, the expectation that the 200-word cache will often contain the current word was abundantly fulfilled. On average, if the current word was an NN-word, it was stored in the NN cache 25.8 % of the time; if it was an IN-word, it was stored in the IN cache 64.7 % of the time. The latter is no surprise - there are relatively few different prepositions - but tim former figure is remarkably high, given the large nmnher of different nouns. Note that the figure would be higher if we counted plurals as variants of the singular word ( as we may do in future implementations ).</Paragraph> <Paragraph position="8"> We have not yet obtained the best-fit weighting for the combined model. However, we tried 3 different combinations for the NN-words and the IN-words. If &quot;a&quot; is the weight for the cache component and &quot;b&quot; the weight for the Marker component, tile a combinations (a, b) are (0.2, 0.8), (0.5, 0.5), and (0.9, 0.1); the pure Marker model corresponds to the weighting (0.0, 1.0).</Paragraph> <Paragraph position="9"> To assess the performance of each combination for NN-words and IN-words, we calculated i), the log product of the estimated probabilities for NN-words only under each of the 4 formulas ii).</Paragraph> <Paragraph position="10"> the log product of the estimated probabilities for IN- words only under each of the 4 formulas. It is then straightforward to calculate the improvement per word obtained by using a cache instead of the pure Marker model.</Paragraph> <Paragraph position="11"> For N'N-words, the (0.2, 0.8) weighting yielded an average multiple of 2.3 in the estimated probability of a word in the testing text over the probability as calculated by the pure Marker model ; the (0.5, 0.5) weighting yielded a multiple of 2.0 per word, and the (0.0, 0.1} actually decreased the probability by a factor of 1.5 per word.</Paragraph> <Paragraph position="12"> For IN-words, the (0.2, 0.8) weighting gave an average multiple of 5.1, the (0.5, 0.5) weighting a multiple of 7.5 and the (0.9, 0.1) weighting a multiple of 6.2 .</Paragraph> <Paragraph position="13"> Conclusions The preliminmT results listed above seem to confirm our hypothesis that reeently-uasd words have a higher probability of occurrence titan the 3g-gram model would predict. Surprisingly , if the above comparison of the POS categories &quot;NN&quot; and &quot;IN&quot; is a reliable guide, this increased probability is more dramatic in the case of content-word categories. Perhaps the smaller number of different prepositions makes the cache-based probabilities more reliable in this ease.</Paragraph> <Paragraph position="14"> Since the cost of maintaining a 200-word cache, in terms of memory and time, is modest, and the increase in predictive power can be great, the approach outlined above should he considered as a simple way of intproving on the performance of a 3g-gram language model for speech recognition. If memory is limited, one would he wise to create caches only for POSs that occur with high frequency and ignore other POSs.</Paragraph> <Paragraph position="15"> Our immediate goal is to build caches for a larger number of POSs, and to obtain the best-fit weighting for each of them, in order to test the full power of the combined model.</Paragraph> <Paragraph position="16"> Eventually, we may explore the possibility of ignoring variations in the exact form of a word, merging the singular form of a noun with its plural, and different tenses and persons of a verb.</Paragraph> <Paragraph position="17"> This line of research has more general implications. The results above seem to suggest that at a given time, a human being works with only a small fraction of his vocabulary.</Paragraph> <Paragraph position="18"> Perhaps if we followed an individual's written or spoken use of language through the eoume of a day, it would consist largely of time spent in language &quot;islands&quot; or sublanguages, with brief periods of time during which he is in transition between islands. One might attempt to chart these &quot;islands&quot; by identifying groups of words which often occur together in the language. If this work is ever carried out on a large scale, it could lead to pseudo-semantic language models for speech recognition, since tbe occurrence of several words characteris$ic of an. &quot;island&quot; makes the appearance of all words in that island more probable.</Paragraph> </Section> class="xml-element"></Paper>