File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/89/h89-1054_abstr.xml
Size: 20,671 bytes
Last Modified: 2025-10-06 13:46:46
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-1054"> <Title>Percent. Coverage of Corpus Static Cache Dynamic Cache</Title> <Section position="1" start_page="0" end_page="294" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This article describes two complementary models that represent dependencies between words in loca/ and non-local contexts. The type of local dependencies considered are sequences of part of speech categories for words. The non-local context of word dependency considered here is that of word recurrence, which is typical in a text. Both are models of phenomena that are to a reasonable extent domain independent, and thus are useful for doing prediction in systems using large vocabularies.</Paragraph> <Paragraph position="1"> Modeling Part of Speech Sequences A common method for modeling local word dependencies is by means of second order Markov models (also known as trigram models). In such a model the context for predicting word wi at position i in a text consists of the two words wi_l, wi-2 that precede it. The model is built from conditional probabilities: P(wi I wi_l, wi-2). The parameters of a part of speech (POS) model are of the form: P(wi \[ Ci) x P(Ci \[ Ci-1, Ci-2).</Paragraph> <Paragraph position="2"> That is, for word wi a POS category Ci is first predicted, based on the POS categories of the two previous words. The word wi is then predicted in terms of Ci. If the vocabulary consists of the set of N words {vl, v2...vg}, then wi, wi-1 and wi-2 range over the elements of this set. For a model containing M parts of speech,{S1, S2...SM}, the variables Ci, Ci-1 and Ci_~ likewise range over the M elements.</Paragraph> <Paragraph position="3"> POS language models have been used in speech recognition systems (Dumouchel, Gupta, Lennig and Mermelstein, 1988; Shikano, 1987) and for phoneme-to-text transcription (Derouault and Merialdo, 1986).</Paragraph> <Paragraph position="4"> In these systems the parameters are obtained from the analysis of an annotated training corpus. To create the training corpus a set of POS categories is first defined. A word in the vocabulary may be associated with several POS categories depending on the roles it can play in a sentence. A suitably large corpus of training text is then manually analyzed and each word of the corpus is annotated with an unambiguous POS category according to its function in the text. The Brown Corpus has been analyzed this way, using a set of 87 main categories (Francis and Kucera, 1982). To obtain the parameters of a language model, frequency counts are made and normalized to produce the required sets of parameters. The problem of training a model, and the reliability of the resulting parameters rests on a laborious manual annotation of a necessarily large amount of training text. To reduce this burden, a bootstrap method can be used (Derouault and Merialdo, 1986).</Paragraph> <Paragraph position="5"> First a relatively small amount of text is annotated, to create a partial model. The partial model is then used to automatically annotate more text, which is then corrected manually, and used to re-train the model.</Paragraph> <Paragraph position="6"> Here, an alternative approach has been taken to the training problem, which is based on work by Jelinek (Jelinek, 1985a). In this approach, a POS language model is viewed as a Hidden Markov model (HMM). That is, a word sequence reflects an underlying sequence of parts of speech, which is hidden from the observer.</Paragraph> <Paragraph position="7"> The advantage of considering the model in such terms is that the need for an annotated training corpus is eliminated, resulting in greater flexibility. The model can be trained with alternative sets of POS categories, and can accommodate special POS categories that are defined for specialized domains. Another advantage is that the method can be applied to other languages. To train a model requires the following: 1. A suitably large training corpus of text (that is not annotated).</Paragraph> <Paragraph position="8"> 2. A vocabulary of words occurring in the corpus, where each word is associated with a list of all the possible parts of speech it can assume.</Paragraph> <Paragraph position="9"> 3. Estimates of the frequency of occurrence P(Vl)...P(vg) of the words in the vocabulary. These are used to set initial probability values in the model.</Paragraph> <Paragraph position="11"> States in the HMM correspond to POS categories, and are labeled by the category they represent. The observations generated at a state are those words that can function as the POS category associated with the state. The probability P(Ci = S, I Ci-1 = Sy) labels the transition from state Sy to state S,, and</Paragraph> <Paragraph position="13"> S, is zero. This provides a strong source of constraint on the estimation process. Each state is connected to all other states, which permits the modeling of any sequence of parts of speech.</Paragraph> <Paragraph position="14"> The HMM used for the work is based on the implementation of Rabiner, Levinson and Sondhi (Rabiner et M., 1983). A first order HMM was used as a starting point because the resources required to train it are significantly less than for a second order model, and thus facilitate experimentation. Twenty-one parts of speech were used, corresponding to traditional POS categories such as determiner, adverb, preposition, noun, noun-plural, verb-progressive, verb-past-participle etc.</Paragraph> <Paragraph position="15"> In the model described here, the elements of the output matrix have been assigned to word equivalence classes rather than individual words. This is due to the observation that in unrestricted domains, the number N of different word types that can occur is so large it precludes practical estimation of the required number of parameters P(wi \[ Ci). Words are partitioned into L equivalence classes W1...WL. All words that can function as a noun only are in one equivalence class, and all words that can function as either a noun or adjective are in another class, and so o11. A total of L = 129 equivalence classes were used in the model. Each equivalence class has an index in the output matrix. Words have an uneven distribution within these classes, e.g. in the dictionary that was used the word &quot;more&quot; is the only member of the class noun-or-comparative-adjcctive-or-adverb. Prior to training, initial values of probabilities in all the matrices must be chosen. The transition matrix is set so that all the transitions from a state are equiprobable. Word occurrence probabilities P(vj) are used to obtain the initial output matrix probabilities. They are first converted to probabilities of word equivalence classes P(Wk). The probability of each equivalence class Wk is then divided equally among the POS categories that are in the equivalence class, to give weights F(Wk, Ci). This reflects tile assumption that all words in an equivalence class can initially function equiprobably as any POS category of the class. The output matrix elements for each state are constructed using the various F(Wk, Ci). For each state, the elements are then normalized to sum to unity.</Paragraph> <Paragraph position="16"> The process of training a HMM is to statistically infer parameters that maximize the likelihood of the training text, and the Baum-Welch algorithm (Baum, 1972) was used for this purpose. Typically, the model is trained on several iterations over the training data, until the parameter estimates converge. Then, for any given test sentence, the Viterbi algorithm (Viterbi, 1967) can be used to find the most likely state sequence through the model, which maximizes the probability of seeing the test sentence. The state sequence is used as the annotation for the sentence.</Paragraph> <Paragraph position="17"> A text corpus containing 1,180,000 words was used for training and testing the model. The corpus is a I THINK THAT IT IS A FAR sub j-pro verb eonj pro verb det adj noun adj noun noun adverb collection of electronic mail messages concerning the design of the Common Lisp programming language. It contains 23,190 different words, of which 8,300 only occur once. A dictionary of approximately 30,000 words was also available, each word tagged with its possible parts of speech. An annotated corpus vocabulary was constructed by intersecting the dictionary with the corpus. This resulted in 90% coverage of the total number of words in the corpus, and 45% coverage of the total word types that appear in the corpus. Words that did not have vocabulary entries giving their parts of speech were marked with a special POS category called unlabeled. A total of 21 POS categories were used initially, resulting in 441 transitions. The output matrix contained 129 elements per state (corresponding to the number of word equivalence classes in the annotated corpus vocabulary). The model was trained initially on 3000 sentences from the corpus, using 10 iterations. Preliminary results are good, even for a first order model. An example of annotation of a sentence is shown in Figure 2. The annotation provided by the Viterbi algorithm is shown in bold on the line below the words of the sentence. The other possible parts of speech present in the dictionary are shown below the annotation. It can be seen that the major dependencies have been inferred correctly; the main difference from the preferred annotation lies in the substitution of adjectival for adverbial categories.</Paragraph> <Section position="1" start_page="291" end_page="294" type="sub_section"> <SectionTitle> Modeling Word Recurrence </SectionTitle> <Paragraph position="0"> The previous method is useful for modeling sequences of parts of speech. Words were grouped into equivalence classes, and it was noted that there was an uneven distribution of words in different classes. The classes for topic independent words such as the, of, in, etc. are relatively small (some are unique) and as a result these words are easily predicted given their equivalence class. The equivalence classes for topic dependent words such as nouns, adjectives and verbs are very much larger. This section describes a method which addresses one aspect of the problem of predicting topic dependent words. The method is concerned with modeling word recurrence, which is typical in discourse. The use of previous discourse history as a means of prediction has been recognized for some time. Barnett (Barnett, 1973) describes a system which allocates higher scores to &quot;content&quot; words if they have already been mentioned. In the MINDS spoken language dialogue system (Young, Hauptmann, Ward, Smith and Werner; 1989) extensive use is made of previous dialogue history, which is integrated with high level knowledge sources. These include dialogue knowledge, user goals, plans and focus.</Paragraph> <Paragraph position="1"> Our model makes use of previous history at the lower level of word transition probabilities, where automatic statistical estimation can be performed. Word recurrence takes place at greater distances than the two word context used in the previous model. To account for this, a long range memory called a word cache is used. A word cache may be described as static or dynamic depending on whether its contents are fixed.</Paragraph> <Paragraph position="2"> A frequency ordered list of words occurring in a corpus represents a static word cache. In this type of word cache the overall probability of occurrence of a word related to a specific topic tends to be small, as a large corpus of text may be composed of material on a large number of diverse topics. The contents of a dynamic word cache may be some function of the previous text history. In a small section of the corpus concerned with a topic, words related to the topic tend to be repeated. A word's probability may then be higher than its overall probability. A dynamic word cache can track this recurrence effect. Dynamic caches can be further split into two kinds. A token cache contains the previous n words (i.e. word tokens) seen in a text, and acts as a window containing the most recent words encountered. A type cache contains the previous n different words (i.e. word types) found, and can be implemented as a linked list of words. When a word is encountered that is already in the cache, the word is simply moved to the front of the list. When a new word type is seen it is placed at the front of the list. If the cache is full, the least recently used word is removed from the tail of the list. An interesting comparison between a static and dynamic word cache is given by the amount of coverage they provide. A static word cache containing the n overall most frequent words in the corpus of Common Lisp mail messages was compared with an equally sized dynamic type cache. Table 1 shows the coverage given over the whole corpus by the two kinds of cache, for various n. For cache sizes from 90 to 4000 words the dynamic type cache gave slightly better coverage than the optimum frequency ordered cache. This characteristic was also observed when using subsections of the corpus. The dynamic cache does not give 100% coverage when its size is equal to the vocabulary, because at the outset it is empty.</Paragraph> <Paragraph position="3"> Words entering the cache for the first time are not covered. The dynamic type cache has the advantage that it is adaptable over different domains whereas the optimum contents of a static cache must be determined by prior analysis. A dynamic token cache was also compared, and it gave inferior performance to either of the other kinds of cache because space ill the cache is occupied by multiple tokens of the same word type.</Paragraph> <Paragraph position="4"> Dynamic type caches have been considered as a means of vocabulary personalization (Jelinek, 1985b).</Paragraph> <Paragraph position="5"> In conjunction with interactive word entry, they have also been viewed as a way of obtaining high coverage in a speech recognition system using a very large vocabulary (Merialdo, 1988). The dynamic cache can also be considered as a means for improving word prediction. For this purpose, it is necessary to quantify its effect and allow direct comparison with an alternative. A first order (bigram) model of word dependency was compared with a similar model into which a dynamic cache had been incorporated. A first order model is composed of parameters of the form P(wi I wi_l). Both wi and wi_l range over all words in the vocabulary.</Paragraph> <Paragraph position="6"> If (v,, v~) is such a word pair, the conditional probability P(wi = v, I wi-1 = vv) denotes the corresponding model parameter. A dynamic cache D is a function of previous history, which can be evaluated at i - 1 for any choice vx of wi, and its inclusion is modeled as P(wi \] wi-1, Di-1). D is binary valued and indicates for any word type v, whether or not it is currently in the cache. This results in two sets of probabilities: P(v~</Paragraph> <Paragraph position="8"> in the training text.</Paragraph> <Paragraph position="9"> The dynamic cache is likewise used to select probabilities when doing prediction. The corpus of Common P(x I Y, D = true) = N(Vx follows Vy, and Vx is in cache) N(Vy is seen, and Vx is in cache) P(x I Y, D = false) = N(Vx follows Vy, and Vx is not in cache) Lisp mail messages was divided in the ratio 80% - 20% for training and testing respectively. Experiments were done with cache sizes ranging from 128 to 4096 words. The criterion of average rank of the correct word was used to compare the &quot;dynamic&quot; model based on P(wi I wi-1, D) to the &quot;static&quot; model based on ordinary bigrams P(wi I wi-1). When a word pair v~, vy occurred in the test text, which had zero probability in the training text, unigrams were used to rank v~. The dynamic model used P(v~ I D) whereas the static model used P(v~). Several runs were made using different sections of the corpus for training and testing. For any run, less than 5% of words in the test text were absent from the training text. Over different runs, the average rank of the correct word in the static model ranged from 520 to 670. In each run, the dynamic model consistently produced a lower average rank, ranging from 7% to 17% less than that of the static model. The performance of the model varied by less than 1% for cache sizes between 350 and 750.</Paragraph> <Paragraph position="10"> Another method of combining the bigram and unigram probabilities is by means of an interpolated estimator (Jelinek and Mercer, 1980). An interpolated estimator is a weighted linear combination of other estimates. Optimum weights are derived by the Baum-Welch algorithm, and their values give an indication of the overall utility of each component. Parameters of both the Static and dynamic models were used in an interpolated estimator. The corpus was divided in the proportion 60% - 20% - 20% respectively for training the model, obtaining the weights, and test text. The use of the interpolated estimator contributed a further reduction in average rank of a few percent. A typical set of values that were obtained for weights of each component are shown below: Pint(vz Ivy) = 0.3P(vx Ivy, D) + 0.25P(vx IVy) + 0.35P(vx I D) + 0.1P(vx) POS categories provide reasonable models of local word order in English. Further work is necessary to understand the sensitivity of the model to factors such as the granularity of POS classifications. States representing sentence boundary and punctuation marks would also be useful. Our model would benefit from further refinement of some POS categories. The category auxiliary-verb is currently absent, and the words &quot;might&quot; and &quot;will&quot; are in the equivalence class labeled noun-or-verb. Accordingly, they are classed as exhibiting the same behaviour as words like &quot;thought&quot;, &quot;feel&quot; and &quot;structure&quot;. It may also be advantageous to assign unique equivalence classes to common words, retaining the same POS categories, but allowing them to assume different parameter values in the output matrix. This enables the modeling of words that function uniquely. Moreover, common words tend to be topic independent and their estimation is not impaired by lack of data. Higher order conditioning is also an obvious area for further work. Due to the limitations of local context and the simplicity of the model, it cannot resolve all kinds of ambiguities found in text. Consider, for instance, the disambiguation of the word &quot;that&quot; in the following sentences: &quot;I think that function is more important than form&quot;.</Paragraph> <Paragraph position="11"> &quot;I think that function is more important than this one&quot;.</Paragraph> <Paragraph position="12"> Traditional parts of speech have been used deliberately, to enable the model to interface to other tools of computational linguistics. In particular, a morphological analyzer is being integrated into the model to aid POS category predictions for words that are not in the vocabulary. The output from the model can also be used as input to other types of linguistic processing.</Paragraph> <Paragraph position="13"> Results indicate that word prediction can be improved by making use of word recurrence, which can be considered as a simple focus-of-attention strategy. For speech recognition it would be a suitable component in a language model that must cover two or more application areas, where each area has its own set of commonly used topic dependent words. The improvements provided by the model depend on the extent to which the usage of a word is repeatedly clustered, resulting in a non-uniform distribution throughout a text. Unlike the part of speech model, common &quot;function words&quot; do not contribute to the word recurrence model. The two models can be viewed as addressing complementary topic-independent and topic-dependent aspects, and could be integrated into a combined model. The word recurrence results would benefit from verification on other corpora, as a corpus consisting of electronic mail is not the ideal choice for such experiments, and must be used with caution. The phenomenon of word recurrence is subsumed by the more general interaction of topic related words, which may be used to predict each other as well as exhibiting recurrence. This would be an interesting direction for future effort.</Paragraph> </Section> </Section> class="xml-element"></Paper>