File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1016_metho.xml
Size: 18,414 bytes
Last Modified: 2025-10-06 14:13:47
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1016"> <Title>On Using Written Language Training Data for Spoken Language Modeling</Title> <Section position="3" start_page="0" end_page="94" type="metho"> <SectionTitle> 2. THE WSJ CORPUS </SectionTitle> <Paragraph position="0"> The November 1993 ARPA Continuous Speech Recognition (CSR) evaluations was based on speech and language taken from the Wall Street Journal (WSJ). The standard language model training text was estimated from about 35 million words of text extracted from the WSJ from 1987 to 1989.</Paragraph> <Paragraph position="1"> The text was normalized (preprocessed) with a model for what words people use to read open text. For example, &quot;$234.56&quot; was always assumed to be read as &quot;two hundred thirty four dollars and fifty six cents&quot;. &quot;March 13&quot; was always normalized as &quot;March thirteenth&quot; - not &quot;March the thirteenth&quot;, nor &quot;March thirteen&quot;. And so on.</Paragraph> <Paragraph position="2"> The original processed text contains about 160,000 unique words. However, many of these are due to misspellings.</Paragraph> <Paragraph position="3"> Therefore, the test corpus was limited to those sentences that consisted only of the most likely 64,000 words. While this vocabulary is still quite large, it has two beneficial effects.</Paragraph> <Paragraph position="4"> First, it greatly reduces the number of misspellings in the texts. Second, it allows implementations to use 2-byte data fields to represent the words rather than having to use 4 bytes.</Paragraph> <Paragraph position="5"> The &quot;standard&quot; recognition vocabulary was defined as the most likely 20,000 words in the corpus. Then, the standard language model was defined as a trigram language model estimated specifically for these 20K words. This standard model, provided by Lincoln Laboratory, was to be used for the controlled portion of the recognition tests. In addition, participants were encouraged to generate an improved language model by any means (other than examining the test data).</Paragraph> </Section> <Section position="4" start_page="94" end_page="95" type="metho"> <SectionTitle> 3. RECOGNITION LEXICON </SectionTitle> <Paragraph position="0"> We find that, typically, over 2% of the word occurrences in a development set are not included in the standard 20K-word vocabulary. Naturally, words that are not in the vocabulary cannot be recognized accurately. (At best, we might try to detect that there is one or more unknown words at this point in a sentence, and then attempt to recognize the phoneme sequence, and then guess a possible letter sequence for this phoneme sequence. Unfortunately, in English, even if we could recognize the phonemes perfectly, there are many valid ways to spell a particular phoneme sequence.) However, in addition to this word not being recognized, we often see that one or two words adjacent to this missing word are also misrecognized. This is because the recognition, in choosing a word in its vocabulary, also now has the wrong context for the following or preceding words. In general, we find that the word error rate increases by about 1.5 to 2 times the number of out-of-vocabulary (OOV) words.</Paragraph> <Paragraph position="1"> One simple way to decrease the percentage of OOV words is to increase the vocabulary size. But which words should be added? The obvious solution is to add words in order of their relative frequency within the full text corpus. There are several problems that might result from this: 1. The vocabulary might have to be extremely large before the OOV rate is reduced significantly.</Paragraph> <Paragraph position="2"> 2. If the word error rate for the vast majority of the words that are already in the smaller vocabulary increased by even a small amount, it might offset any gain obtained from reducing the OOV rate.</Paragraph> <Paragraph position="3"> 3. The language model probabilities for these additional words would be quite low, which might prevent them from being recognized anyway.</Paragraph> <Paragraph position="4"> We did not have phonetic pronunciations for all of the 64K words. We sent a list of the (approximately 34K) words for which we had no pronunciations to Boston University. They found pronunciations for about half (18K) of the words in their (expanded Moby) dictionary. When we added these words to our WSJ dictionary, we had a total of 50K words that we could use for recognition.</Paragraph> <Paragraph position="5"> The following table shows the percentage of OOV words as a function of the vocabulary size. The measurement was done on the WSJ1 Hubl &quot;20K&quot; development test which has 2,464 unique words with the total count of 8,227 words. Due to the unavailability of phonetic pronunciations (mentioned above), the final vocabulary size would be the second column.</Paragraph> <Paragraph position="6"> We were somewhat surprised to see that the percentage of OOV words was reduced to only 0.17% when the lexicon included the most likely 40K words - especially given that many of the most likely words were not available because we did not have phonetic pronunciations for them. Thus, it was not necessary to increase the vocabulary above 40K words.</Paragraph> <Paragraph position="7"> The second worry was that increasing the vocabulary by too much might increase the word error rate due to the increased number of choices. For example, normally, if we double the vocabulary, we might expect an increase in word error rate of about 40%! So we performed an experiment in which we used the standard 20K language model for the 5K development data. We found, to our surprise, that the error rate increased only slightly, from 8.7% to 9.3%. Therefore, we felt confident that we could increase the vocabulary as needed.</Paragraph> <Paragraph position="8"> We considered possible explanations for the small increase in error due to a larger vocabulary. We realized that the answer was in the language model. In the first case, when we just increase the vocabulary, the new words also have the same probability in the language model as the old words.</Paragraph> <Paragraph position="9"> However, in this case, all the new words that were added had lower probabilities (at least for the unigram model) than the existing words. Let us consider two possibilities that we would not falsely substitute a new word for an old one. If the new word were acoustically similar to one of the words in the test (and therefore similar to a word in the original vocabulary, then the word would be correctly recognized because the original word would always have a higher language model probability. If, on the other hand, the new word were acoustically very different from the word being spoken, then we might expect that our acoustic models would prevent the new word from being chosen over the old word. While the argument makes some sense, we did not expect the loss for increasing the vocabulary from 5K words to 20K words to be so small.</Paragraph> <Paragraph position="10"> Finally, the third question is whether the new words would be recognized when they did occur, since (as mentioned above) their language model probabilities were generally low. In fact, we found that, even though the error rate for these new words was higher than for the more likely words, we were still able to recognize about 50% to 70% of them correctly, presumably based largely on the acoustic model. Thus, the net effect of this was to reduce the word error rate by about 1% to 1.5%, absolute.</Paragraph> </Section> <Section position="5" start_page="95" end_page="95" type="metho"> <SectionTitle> 4. MODELING SPOKEN LANGUAGE </SectionTitle> <Paragraph position="0"> Another effect that we worked on was the difference between the l~:ocessed text, as defined by the preprocessor, and the words that people actually used when reading WSJ text. In the pilot WSJ corpus, the subjects were prompted with texts that had already been &quot;normalized&quot;, so that there was no ambiguity about how to read a sentence. However, in the WSJ1 corpus, subjects were instructed to read the original texts and to say whatever seemed most appropriate to them.</Paragraph> <Paragraph position="1"> Since the WSJ1 prompting texts were not normalized to deterministic word sequences, subjects showed considerable variability in their reading of the prompting text.</Paragraph> <Paragraph position="2"> However, the standard language model was derived from the normalized text produced by the preprocessor. This resulted in a mismatch between the language model and the actual word sequences that were spoken. While the preprocessor was quite good at predicting what people said most of the time, there were several cases where people used different words than predicted. For example, the preprocessor predicted that strings like &quot;$234&quot; would be read as &quot;two hundred thirty four dollars&quot;. But in fact, most people read this as &quot;two hundred AND thirty four dollars&quot;. For another extreme example, the preprocessor's prediction of &quot;10.4&quot; was &quot;ten point four&quot;, but the subject (in the WSJ1 development data) read this as &quot;ten and four tenths&quot;. There were many other similar examples.</Paragraph> <Paragraph position="3"> The standard model for the tests was the &quot;nonverbalized punctuation&quot; (NVP) model, which assumes that the reeaders never speak any of the punctuation words. The other model that had been defined was the &quot;verbalized punctuation&quot; (VP) model, which assumed that all of the punctuation was read out loud. This year, the subjects were instructed that they were free to read the punctuation out loud or not, in whatever way they feel most comfortable. It turns out that people didn't verbalize most punctuation. However, they regularly verbalized quotation marks in many different ways that were all different than the ways predicted by the standard preprocessor. null There were also several words that were read differently by subjects. For example, subjects pronounced abbreviations like, &quot;CORP.&quot; and &quot;INC.&quot;. While the preprocessor assumed that all abbreviations would be read as full words.</Paragraph> <Paragraph position="4"> We used two methods to model the ways people actually read text. The simpler approach was to include the text of the acoustic training data in the language model training.</Paragraph> <Paragraph position="5"> That is, we simply added the 37K sentence transcriptions from the acoustic training to the 2M sentences of training text. The advantage of this method is that it modeled what people actually said. The system was definitely more likely to recognize words or sequences that were previously impossible. The problem with this method was that the amount of transcribed speech was quite small (about 50 times smaller) compared to the original training text. We tried repeating the transcriptions several times, but we found that the effect was not as strong as we would like.</Paragraph> <Paragraph position="6"> A more powerful approach was to simulate the effects of the different word choices by simple rules which were applied to all of the 35M words of language training .text. We chose to use the following rules:</Paragraph> </Section> <Section position="6" start_page="95" end_page="95" type="metho"> <SectionTitle> AND A HALF AND A QUARTER </SectionTitle> <Paragraph position="0"> Thus, for example, ff the sentence consists of the pattern &quot;hundred twenty&quot;, we repeated the same sentence with &quot;hundred AND twenty&quot;.</Paragraph> <Paragraph position="1"> The result was that about one fifth of the sentences in the original corpus had some change reflecting a difference in the way subjects read the original text. Thus, this was equivalent in weight to an equal amount of training text to the original text.</Paragraph> <Paragraph position="2"> We found that this preprocessing of the text was sufficient to cover most of those cases where the readers said things differently than the predictions. The recognition results showed that the system now usually recognized the new word sequences and abbreviations correctly.</Paragraph> </Section> <Section position="7" start_page="95" end_page="96" type="metho"> <SectionTitle> 5. INCREASING THE LANGUAGE MODEL TRAINING </SectionTitle> <Paragraph position="0"> While 35M words may seem like a lot of data, it is not enough to cover all of the trigrams that are likely to occur in the testing data. So we considered other sources for additional language modeling text. The only easily accessible data available was an additional 3 years (from 1990-1992) of WSJ data from the TIPSTER corpus produced by the Linguistic Data Consortium (LDC).</Paragraph> <Paragraph position="1"> However, there were two problems with using this data.</Paragraph> <Paragraph position="2"> First, since the test data was known to come from 19871989, we were concerned that this might actually hurt perforrnance due to some differences in the topics during that 3-year period. Second, this text had not been normalized with the preprocessor and we did not have available to us the preprocessor that was used to transform the raw text into word sequences.</Paragraph> <Paragraph position="3"> We decided to use the new text with minimal processing.</Paragraph> <Paragraph position="4"> The text was filtered to remove all tables, captions, numbers, etc. We replaced each initial example of double-quote (&quot;) with &quot;QUOTE and the matching token with &quot;UNQUOTE or &quot;ENDQUOTE, which were the most common ways these words were said. No other changes were made. We just used the raw text as it was. One benefit of this was that abbreviations were left as they appeared in the text rather than expanded. Any numbers, dates, dollar amounts, etc, were just considered &quot;unknown&quot; words, and did not contribute to the training. We assumed that we had sufficient examples of numbers in the original text.</Paragraph> <Paragraph position="5"> We found that adding this additional language training dam reduced the er~r by about 7% of the error, indicating that the original 35 million words was not sufficient for the models we were using. Thus, the addition of plain text, even though it was from a different three years, and had many gaps due to apparent unknown words, still improved the recognition accuracy considerably.</Paragraph> </Section> <Section position="8" start_page="96" end_page="96" type="metho"> <SectionTitle> 6. RESULTS </SectionTitle> <Paragraph position="0"> The following table shows the benefit of the enlarged 40K lexicon and the enhanced language model training on the OOV rate and the word error for the development test and the evaluation test.</Paragraph> <Paragraph position="1"> Surprisingly, the addition of three year's LM training (from a period post-dating the test data) improved performance on the utterances that were completely inside the vocabulary. Evidently, even the common trigrams are poorly trained with only the 35 million word WSJ0 corpus. Overall, our modifications to the lexicon and grammar training reduced the word error by 14--22%.</Paragraph> </Section> <Section position="9" start_page="96" end_page="96" type="metho"> <SectionTitle> 7. Spontaneous Dictation </SectionTitle> <Paragraph position="0"> Another area we investigated was spontaneous dictation.</Paragraph> <Paragraph position="1"> The subjects were primarily former or practicing journalists with some experience at dictation. They were instructed to dictate general and financial news stories that would be appropriate for a newspaper like WSJ. In general, the journalists chose topics of recent interest. This meant that the original language model was often out of date for the subject. As a result, the percentage of OOV words increased (to about 4%), and the language model taken from WSJ text was less appropriate.</Paragraph> <Paragraph position="2"> The OOV words in the spontaneous data were more likely to be proper nouns from recent events that were not covered by the LM training material. To counter this, we added all (1,028) of the new words that were found in the spontaneous portion of the acoustic training data in WSJ1. This mostly included topical names (e.g., Hillary Rodham, NAFTA, etc.).</Paragraph> <Paragraph position="3"> In order to account for some of the differences between the read text and the spontaneous text, and to have language model probabilities for the new words, we added the training transcriptions of the spontaneous dictation (about 8K sentences) to the LM training as well.</Paragraph> <Paragraph position="4"> New weights for the new language model, HMM, and Segmental Neural Network were all optimized on spontaneous development test data. The table below shows that the OOV remains near 1% even after the enlargement to a 41K lexicon. null As can be seen, increasing the vocabulary size from 20K to 40K significantly reduced the OOV rate. It is important to point out that in this case, we did not have the benefit of a word frequency list for spontaneous speech, and that the source of speech had an unlimited vocabulary. So the reduction in OOV rate is certainly a fair - if not pessimistic - estimate of the real benefit from increasing the vocabulary. Adding the few new words observed in the spontaneous speech also helped somewhat, but not nearly as much. The sample of only 8,000 sentences is clearly not sufficient to find all the new words that people might use. Presumably, if the sample of spontaneous speech were large enough to derive word frequencies, then we could choose a much better list of 40K words with a lower OOV rate.</Paragraph> <Paragraph position="5"> Overall, the 41K trigram reduces the word error by 23% over the 20K standard trigram on the November '93 CSR $9 evaluation test. We estimate that more than half of this gain was due to the decreased percentage of OOV words, and the remainder was due to the increased language model training, including specific examples of spontaneous dictation.</Paragraph> </Section> class="xml-element"></Paper>