File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0118_metho.xml

Size: 17,640 bytes

Last Modified: 2025-10-06 14:14:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0118">
  <Title>apos; i The Effects of Corpus Size and Homogeneity on Language Model Quality</Title>
  <Section position="4" start_page="183" end_page="183" type="metho">
    <SectionTitle>
ARTICLES FROM PRACTICAL PC NOV 9$1&amp;NDASH;FE
ELECTRONIC INFORMATION RESOURCES AND THE HI
WHAT PERSONAL COMPUTER
WHAT PERSONAL COMPUTER: THE ULTIMATE GUIDE
MISC~rrI.ANEOUS ARTICLES ABOUT DESK-TOP PUBLI
ACCOUNTANCY
ACCOUNTANCY
IDE/~ IN ACTION PROGRAMMES (03) -- AN ELECT
MOLTIMEDIA IN THE 1990S
PEOPLE IN ORGANISATIONS
RESULTS OF PRSTATECTOMY SURVEY -- AN ELECTR
ECOVER BIO-DEGRADABLE HOUSEHOLD CLEANING PR
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRAN
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRAN
SPOKEN MATERIALFROMRESPOND~T PAMELA2 --
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRAN
ROCKWELL/THE BETTER ALTERNATIVE TO THE FLAT
THE WEEKLY LAW REPORTS 1992 VOLUME 3&amp;LSQB;P
AUCTION ROOMS -- AN ELECTRONIC TRANSCRIPTIO
</SectionTitle>
    <Paragraph position="0"> Each line shows the filename, the title of the text, the number of common words and the value for r. At first glance, the results appear to be intuitively satisfying. Of the top ten texts, six have titles that are clearly related to computing, including all of the top five. The remaining four could arguably be classified as Commerce &amp; Finance (which was identified as the second most similar domain to email). However, a suitable tide is no guarantee of suitable contents. As far as can reasonably be expected, the tides constitute a fair and accurate reflection of the contents of each text. Of course, the whole point of this  approach is to develop techniques that do not rely on ambiguous manual annotations such as title or domain, so the presence of suitable floes is merely an initial indication of success.</Paragraph>
    <Paragraph position="1"> One way of evaluating this result is to go through the list and calculate the mean rank of the 61 &amp;quot;Computergram International&amp;quot; texts, which are typical of the sort of texts this technique should identify as being similar to the email corpus. If the technique is working perfectly, the mean rank should be 31. If it is completely random, the mean rank would be 2062. It transpires that the mean rank is 959.85 (std dev = 524.44). Clearly, this result is better than chance, but far from significant. One of the main reasons for this was a tendency to sometimes give high scores to texts that were actually too short to constitute reliable sainples (the BNC attempts to maintain a standard sample size but this is not always possible). A logical modification was therefore to ignore those texts for which the number of common words was below a certain threshold. A number of threshoIds were investigated, and the optimum value (determined empirically) was around 1,370 words. However, even with this modification, the mean rank remained as high as 818.41 (std dev = 407.81). It is possible to reduce this value still further, but only by compromising the overall recall value (i.e. genuine texts are eliminated along with the &amp;quot;noise&amp;quot;).</Paragraph>
    <Paragraph position="2"> However, there is a more fundamental limitation to the above methodology. The rank correlation statistic compares differences in rank, ignoring absolute value (which can be significant). To illustrate, consider a case where the word &amp;quot;of' is ranked 3 in one corpus and 6 is another. This is a very important difference. Conversely, ff &amp;quot;banana&amp;quot; is ranked 10,000 in one corpus and 100,000 in another, this is a very insignificant difference. But the difference of ranks for &amp;quot;of&amp;quot; = 3, for &amp;quot;banana&amp;quot; = 90,000. Clearly this technique is missing something important. Consequently, it was decided to investigate an alternative measure: the Loglikellhood Ratio Statistic.</Paragraph>
    <Paragraph position="3"> The Logllkelihood Ratio, G 2, is a mathematically well-grounded and accurate method for calculating how &amp;quot;surprising&amp;quot; an event is (Dunning, 1993). This is true even when the event has only occurred once (as is often the case with linguistic phenomena). It is an effective measure for the determination of domain-specific terms (e.g. Daille, 1995) and can be also used as a measure of corpus similarity. In the case where two corpora are being compared, it is possible to calculate the G 2 statistic either for single words (using a ~ contingency table) or for a vocabulary of N words (an N&gt;&lt;2 table). The analysis of the 4,000+ BNC fries was therefore repeated using the Loglikelihood (instead of rank correlation) as the similarity measure. This produced the following top and bottom 10 texts:</Paragraph>
    <Paragraph position="5"/>
  </Section>
  <Section position="5" start_page="183" end_page="188" type="metho">
    <SectionTitle>
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRANSCRIPTION 23226
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRANSCRIPTION 23226
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRANSCRIPTION 23228
STAFF MEETING -- AN ELECTRONIC TRANSCRIPTION 23226
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRANSCRIPTION 23230
MEDICAL CONSULTATIONS -- AN ELECTRONIC TRANSCRIPTION 23231
SPOKEN MATERIAL FROM RESPONDENT 716 -- AN ELECTRONIC 23227
BRISTOL UNIVERSITY -- AN ELECTRONIC TRANSCRIPTION 23231
GUARDIAN, ELECTRONIC EDITION OF 19891210; APPSCI MAT 23232
</SectionTitle>
    <Paragraph position="0"> Each fine shows the filename, the title of the text, the length of the contingency table and the value for G 2.</Paragraph>
    <Paragraph position="1"> These are sorted in ascending order since comparing two identical documents would produce a G 2 of zero.</Paragraph>
    <Paragraph position="2">  ! A brief inspection of the titles of the documents at the top of the list would indicate that the metric has not produced an improvemenL Moreover, it transpires that the mean rank of the CI texts is now 1171.98, with std dev = 178.54. However, as before, the number of common words is very small for some of the texts. Therefore, the filter was applied to ignore eases where there were fewer than 1,370 words in eornmon. This produced a mean rank of I25.15 (std. dev. = 75.62), which is significantly lower than that produced by the rank correlation (mean rank = 818.41, std. dev. = 75.621).</Paragraph>
    <Paragraph position="3"> So despite the absence of apparently suitable candidates in the top 10, the overall accuracy of this technique (measured by the mean rank of the 61 CI texts) is higher. The G 2 statistic appears to be more suitable for this type of data since it uses the actual frequency values for the words in the wfls, rather than just their ranks. Other independent sources indicate that the G 2 produces results that appear to correspond reasonably well with human judgement (Dallle, 1995).</Paragraph>
    <Paragraph position="4"> However, both the rank correlation and Loglikelihood Ratio both make use only of unigrarn information. Clearly, much of the information that humans use to measure textual similarity is found not (solely) in the individual word frequencies (unigrarns), but rather in the way they combine (n-grams). The logical next step is therefore to compare word bigrams (or trigrarns) instead of just unigrarn data. A variation on this would be to compare texts using the Loglikelihood applied to bigrams that are not necessarily adjacent, i.e. counting occurrences of wordl and word2 within a limiting distance of each other. Indeed, such methods have been previously used for actually building the LMs themselves, and have been successfully applied to both speech (Rose &amp; Lee, 1994) and handwriting data (Rose &amp; Evett, 1995). Counting words within a limited window would be smoother than using strict bigrarns and eousequently less affected by the problems caused by sparse data (which are inevitable when small, individual text files are compared). Another interesting possibility is to use the LM itself as the similarity metric. From an information theoretic point of view, entropy is a measure of a eorpus's homogeneity, and the cross-entropy between two corpora is a measure of their similarity (Charniak, 1993). After all, when a LM is applied to a test text to produce a perplexity score, this value is a measure of the cross-entropy which reflects how well the LM predicts the words in the text. So if a LM is trained on text that is very similar to the test text, then it should predict the test data well and the perplexity should be low. Conversely, ff the test text is very different from the training text, then the perplexity will be high. The perplexity score can therefore be used to measure textual similarity. Moreover, it has the advantage doing so by considering (typically) uuigram, bigrarn and trigram data. Indeed, this method has already been successfully used within the development of a similarity-based Interact search agent, and preliminary findings indicate that perplexity is indeed an effective corpus similarity measure (Rose &amp; Wyard, 1997).</Paragraph>
    <Paragraph position="5"> However, the use of such an approach is not entirely beyond question. Firstly, the LM is being used as the representation of a training text against which similarity is to be judged, and yet it is, by definition, undertrained and therefore degenerate. Secondly, the method by which similarity is measured should ideally be independent to the method by which success is evaluated. To use perplexity both as a similarity metric and an evaluation metric implies a certain amount of circular reasoning. However, the use of such iterafive techniques is not totally without precedent within the LM eornmunity. Several research groups have reported the successful improvement of LMs using techniques that iteratively tune the LM parameters using new samples of training data (e.g. Jelinek, 1990). So, this approach may transpire to be</Paragraph>
    <Paragraph position="7"> sufficient!y well principled to merit further investigation, i 111 4. Language model quality A LM is built by collecting trigram, bigram &amp; uuigram data from a training corpus. However, it is not always desirable to store all of this data. Thresholds can be set such that some of the lower frequency n-grams are discarded. For example, a trigram cut-off of 5 implies that all the trigrams with frequencies of 5 or fewer in the training data are not used in building the model. Setting lower thresholds allows the model  to focus on more frequent events, and produces a proportionately smaller model. The LMs described in this paper were built using the CMU SLM toolkit (Rosenfeld, 1994) which facilitated the construction of a variety of LMs representing a range of different settings for each of the pertinent parameters. The first of these was the Email LM. This was constructed using a vocabulary of 20,000 words that was derived directly from the ernail training data. The bigram and trigram cutoffs were both set to zero. The second LM was buik from the whole of the BNC, using the same vocabulary as the Email LM (in order to ensure consistency). So although their n-grams had been based on general English rather than Email, their vocabulary was derived from the Email data. For comparison therefore, a third BNC LM was built, using a vocabulary derived directly from the BNC (rather than email). This allowed the comparative evaluation of the contribution of vocabulary vs. n-grams to the LM effectiveness (measured using both perplexity and word error rate). Due to memory constraints it was not possible to build the BNC models with cut-offs lower than 2-2. The fourth LM investigated was the 20k WSJ LM that is available from the Abbot ftp site at Cambridge University.</Paragraph>
    <Paragraph position="8"> The standard measure by which LMs are assessed is by calculating their perplexity using a sample of test data. This process is usually performed off-line, i.e. independently of the speech reeogniser for which the models are intended. For the models described above, testing was performed using the CMU toolkit, by applying each LM to a sample of 10,000 words from the transcriptions of a database of video mail messages, developed by Cambridge University as part of their &amp;quot;Video Mail Retrieval using Voice&amp;quot; project (Jones et al., 1994). Evidently, this data is not actually spoken email, but its domain and genre are nevertheless closely related to email. Unfortunately, it was not possible to calculate the PP of the WSJ LM due to the absence of a readily available version in the correct format.</Paragraph>
    <Paragraph position="9"> A second evaluation method is to integrate the I,M with the speech reeogniser and test the combined system using recorded speech data. The models can be interchanged between trials, allowing comparative evaluation by measuring the word error rate (WER) produced by each model. More precisely, the error rates are measured using two standard metrics, percentage correct and accuracy:</Paragraph>
    <Paragraph position="11"> where: H is the number of correct transcriptions (words in the utterance that are found in the transcription), D is the number of deletions (words in the utterance that are missing from the transcription), S is the number of substitutions (words in the utterance that are replaced by an incorrect word in the transcription), and I is the number of insertions (extra words in the transcription). Accuracy is more critical than %correct in that it directly penalises insertions. Deletions &amp; substitutions reduce the value of H, since H = N - (D+S).</Paragraph>
    <Paragraph position="12"> As mentioned above, the VMR database is a collection of speech data with transcriptions (of which the latter were used in the above evaluation). The speech part contains audio files for 15 speakers, of which 10 were used in the current investigation. The Abbot recogniser was run using each combination of the 10 speakers' data files (as input) and each of the four LMs: email, BNC with email vocabulary, BNC and the WSI LM. The output transcriptions were assessed for %correct and accuracy using the HResults program, which is part of HTK - the Hidden Markov Model Toolkit (Young &amp; Woodland, 1993).</Paragraph>
    <Paragraph position="13"> Table 2 shows the results of this investigation. The results for %correct and accuracy show the combined effect of the recogniser and LM. The contribution of the LM depends on its vocabulary and perplexity. As  the LM changes, it produces different behaviour in the combined system and therefore different types of errors (e.g. insertions, deletions &amp; substitutions). The net effect is that the email LM produces the highest %correct and also the highest accuracy. It is around 5% better (on both measures) than the WSJ LM. This is significant, considering the tiny corpus from which it was derived (2 million vs. 227 million in the case of WSJ). In between these two extremes are the two BNC LMs - the one with the email vocabulary performs slightly better (-0.5%) than the one with the BNC vocabulary.</Paragraph>
    <Paragraph position="14">  The result for the PP testing is highly revealing. As described earlier, a corpus of low homogeneity should produce a LM of higher PP than a corpus of high homogeneity. This is indeed shown to be the case, since the PP for email is 261.58 (homogeneity = 0.362), whereas the PP for the BNC is 227.54 (homogeneity = 0.687). These PP values are calculated using the 10K test data sample from the transcriptions of the VMR project. The higher PP value for email would tend to indicate that this is the poorer LM. However, it is clear that when used on the real spoken data, the email LM provides the lowest error rates. Initial explanations for this centred on the vocabulary, since a higher incidence of out-of-vocabulary (OOV) words can produce a lower PP but a higher WER. However, the email LM performs better (by 0.88% ,</Paragraph>
    <Paragraph position="16"> correct) than the BNC/email LM even though both share the same vocabulary. Two explanations for this are possible. Firstly, there may be n-grams in the email corpus that are simply not found in the BNC (even though the BNC is 50 times larger). Secondly, the email LM may be better because it &amp;quot;wastes&amp;quot; less probability mass on n-grams that never actually occur in the test data. This implies that quality, not quantity, is a major factor in training effective LMs. Further PP testing, possibly using the complete transcriptions of the VMR data is necessary to clarify this issue.</Paragraph>
    <Paragraph position="17"> Evidently, the choice of vocabulary also makes an important contribution. The BNC LM with the email vocabulary performs better (by 0.64% correct) than the BNC LM with the BNC vocabulary, so clearly the email vocabulary provides better coverage of the test data. In fact, it is possible to directly compare the OOV rates with the performances shown above: the BNC LM with the ernail vocabulary has an OOV rate of 1.16% on the VMR data, and a %correct of 54.04. By contrast, the BNC LM with the BNC vocabulary has an OOV rate of 1.69% and a %correct of 53.40. These figures suggest that an increase in OOV rate of 0.56% leads to a reduction in %correct of 0.64%, or, in other words, a 1% increase in OOV rate produces a reduction in %correct of around 1.14%. Interestingly, this figure correlates extremely well with the results of a similar experiment performed by Rosenfeld (1995), who found that a 1% increase in the OOV rate can lead to a 1.2% increase in the word error rate.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML