File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1014_metho.xml

Size: 10,651 bytes

Last Modified: 2025-10-06 14:08:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1014">
  <Title>Modeling of Long Distance Context Dependency</Title>
  <Section position="3" start_page="0" end_page="21" type="metho">
    <SectionTitle>
2 Ngram Modeling
</SectionTitle>
    <Paragraph position="0"> Let , where 's are the words that make up the hypothesis, the</Paragraph>
    <Paragraph position="2"> probability of the word string, , can be computed by using the chain rule:  = , an ngram model estimates the log probability of the word string, log , by re-writing Equation (2.2): )(SP</Paragraph>
    <Paragraph position="4"> )|()()( (2.1) By taking a log function to both sides of Equation (2.1), we have the log probability of the word string, log : )(SP</Paragraph>
    <Paragraph position="6"> where is the string length, is the i -th word in the string .</Paragraph>
    <Paragraph position="8"> So, the classical task of statistical language modeling becomes how to effectively and efficiently predict the next word, given the previous words, that is to say, to estimate expressions of the form . For convenience, is often written as , where h , is called history.</Paragraph>
    <Paragraph position="10"> From the ngram model as in Equation (2.3), we have:</Paragraph>
    <Paragraph position="12"/>
    <Paragraph position="14"> Traditionally, simple statistical models, known as ngram models, have been widely used in speech recognition. Within an ngram model, the probability of a word occurring next is estimated based on the previous words. That is to say,</Paragraph>
    <Paragraph position="16"/>
    <Paragraph position="18"> For example, in bigram model (n=2) the probability of a word is assumed to depend only on the previous word: where</Paragraph>
    <Paragraph position="20"> is the mutual information of the word string pair , and</Paragraph>
    <Paragraph position="22"> the mutual information of the word string pair . is the distance of the two word strings in the word string pair and is equal to 1 when the two word strings are adjacent.</Paragraph>
    <Paragraph position="24"> And the probability can be estimated by using maximum likelihood estimation (MLE) principle:</Paragraph>
    <Paragraph position="26"> Where represents the number of times the sequence occurs in the training text. In practice, due to the data sparseness problem, some smoothing technique (e.g. Good Turing in [Chen and Goodman 1999]) is applied to get more accurate estimation.</Paragraph>
    <Paragraph position="27"> )(*C For a pair ( over a distance where and ), BA d A B are word strings, the mutual information reflects the degree of preference relationship between the two strings over a distance . Several properties of the mutual information are apparent: )d,,( BAMI d Obviously, an ngram model assumes that the probability of the next word is independent of the word string in the history. The difference between bigram, trigram and other ngram models is the value of n. The parameters of an ngram model are thus the probabilities:  A B have. Therefore, we can use the mutual information to measure the preference relationship degree of a word string pair.</Paragraph>
    <Paragraph position="28"> where , and i . That is to say, the mutual information of the next word with the history is assumed equal to the summation of that of the next word with the first word in the history and that of the next word with the rest word string in the history. Then we can re-write Equation (3.3) by using Equation (3.4),</Paragraph>
    <Paragraph position="30"> Using an alternative view of equivalence, an ngram model is one that partitions the data into equivalence classes based on the last n-1 words in the history. Viewed in this way, a bigram induces a partition based on the last word in the history. A trigram model further refines this partition by considering the next-to-last word and so on.</Paragraph>
    <Paragraph position="32"> As the word trigram model is most widely used in current research, we will mainly consider the word trigram-based model. By re-writing Equation (2.2), the word trigram model estimates the log probability of the string</Paragraph>
    <Paragraph position="34"/>
    <Paragraph position="36"/>
    <Paragraph position="38"> Obviously, the first item in equation (3.7) contributes to the log probability of the normal word ngram within an N-words window while the second item is the mutual information which contributes to the long distance context dependency of the next word with the previous words outside the n-words window of the normal word ngram model.</Paragraph>
    <Paragraph position="40"> Compared with the normal word ngram model, the novel MI-Ngram model also incorporates the long distance context dependency by computing the mutual information of the distance dependent word pairs. That is, the MI-Ngram model incorporates the word occurrences beyond the scope of the normal ngram model.</Paragraph>
    <Paragraph position="41"> Since the number of possible distance-dependent word pairs may be very huge, it is impossible for the MI-Ngram model to incorporate all the possible distance-dependent word pairs. Therefore, for the MI-Ngram model to be practically useful, how to select a reasonable number of word pairs becomes most important. Here two approaches are used (Zhou G.D., et al 1998):  )|(log One approach is to restrict the window size of possible word pairs by computing and comparing the conditional perplexities (Shannon C.E. 1951) of the long distance word bigram models for different distances.</Paragraph>
    <Paragraph position="42"> Conditional perplexity is a measure of the average number of possible choices there are for a conditional distribution. The conditional perplexity of a conditional distribution with the conditional entropy is defined to  From Equation (3.8), we can see that the first three items are the values computed by the normal word trigram model as shown in Equation (2.9) and the forth item contributes to summation of the mutual information of the next word with the words in the history . Therefore, we call Equation (3.8) as a MI-Ngram model and rewrite it as:  For a large enough corpus, the conditional perplexity is usually an indication of the amount of information conveyed by the model: the lower the conditional perplexity, the more information it conveys and thus a better model. This is because the model captures as much as it can of that information, and whatever uncertainty remains shows up in the conditional perplexity. Here, the corpus is the XinHua corpus, which has about 57M(million) characters or 29M words. For all the experiments, 80% of the corpus is used for  training while the remaining 20% is used for testing.</Paragraph>
    <Paragraph position="43"> Table 1 shows that the conditional perplexity is lowest for d = 1 and increases significantly as we move through d = 2, 3, 4, 5 and 6. For d = 7, 8, 9, the conditional perplexity increases slightly while further increasing d almost does not increase the conditional perplexity. This suggests that significant information exists only in the last 6 words of the history. In this paper, we restrict the maximum window size to 10.</Paragraph>
    <Paragraph position="44">  Obviously, Equation (3.12) takes the joint probability into consideration. That is, those frequently occurring word pairs are more important and have much more potential to be incorporated into the MI-Ngram model than less frequently occurring word pairs.</Paragraph>
  </Section>
  <Section position="4" start_page="21" end_page="21" type="metho">
    <SectionTitle>
4 Experimentation
</SectionTitle>
    <Paragraph position="0"> We have evaluated the new MI-Ngram model in an experimental speaker-dependent continuous Mandarin speech recognition system (Zhou G.D. et al 1999). For base syllable recognition, 14 cepstral and 14 deltacepstral coefficients, energy(normalized) and delta-energy are used as feature parameters to form a feature vector with dimension 30, while for tone recognition, the pitch period and the energy together with their first order and second order delta coefficients are used to form a feature vector with dimension 6. All the acoustic units are modeled by semi-continuous HMMs (Rabiner 1993). For base syllable recognition, 138 HMMs are used to model 100 context-dependent INITIALs and 38 context-independent FINALs while 5 HMMs are used to model five different tones in Mandarin Chinese. 5,000 short sentences are used for training and another 600 sentences (6102 Chinese characters) are used for testing. All the training and testing data are recorded by one same speaker in an office-like laboratory environment with a sampling frequency of 16KHZ.</Paragraph>
    <Paragraph position="1"> As a reference, the base syllable recognition rate and the tone recognition rate are shown in Table 2 and Table 3, respectively. As the word trigram model is most widely used in current research, all the experiments have been done using a MI-Trigram model which is trained on the XINHUA news corpus of 29 million words(automatically segmented) while the lexicon contains about 28000 words. As a result, the perplexities and Chinese character recognition rates of different MI-Trigram models with the same window size of 10 and different numbers of distance-dependent word pairs are shown in Table 4.</Paragraph>
    <Paragraph position="2">  tone 1 tone 2 tone 3 tone 4 tone 5  Table 4 shows that the perplexity and the recognition rate rise quickly as the number of the long distance-dependent word pairs in the MI-Trigram model increase from 0 to 800,000, and then rise slowly. This suggests that the best 800,000 word pairs carry most of the long distance context dependency and should be included in the MI-Ngram model. It also shows that the recognition rate of the MI-Trigram model with 800,000 word pairs is 1.9% higher than the pure word trigram model (the MI-Trigram model with 0 long distance-dependent word pairs). That is to say, about 20% of errors can be corrected by incorporating only 800,000 word pairs to the MI-Trigram model compared with the pure word trigram model.</Paragraph>
    <Paragraph position="3"> It is clear that MI-Ngram modeling has much better performance than normal word ngram modeling. One advantage of MI-Ngram modeling is that its number of parameters is just a little more than that of word ngram modeling. Another advantage of MI-Ngram modeling is that the number of the word pairs can be reasonable in size without losing too much of its modeling power. Compared to ngram modeling, MI-Ngram modeling also captures the long distance dependency of word pairs using the concept of mutual information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML