File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1044_metho.xml
Size: 14,417 bytes
Last Modified: 2025-10-06 14:07:03
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1044"> <Title>Named Entity Extraction from Noisy Input: Speech and OCR</Title> <Section position="4" start_page="316" end_page="316" type="metho"> <SectionTitle> START-OF-SENTENCE and END-OF-SENTENCE </SectionTitle> <Paragraph position="0"> states. In addition to generating the word, states may also generate features of that word.</Paragraph> </Section> <Section position="5" start_page="316" end_page="317" type="metho"> <SectionTitle> START-OF-SENTENCE END-OF SENTENCE </SectionTitle> <Paragraph position="0"/> </Section> <Section position="6" start_page="317" end_page="317" type="metho"> <SectionTitle> 3 Effect of Textual Clues </SectionTitle> <Paragraph position="0"> The output of each of the speech recognizers is in SNOR (speech normalized orthographic representation) format, a format which is largely unpunctuated and in all capital letters (apostrophes and periods after spoken letters are preserved). When a typical NE extraction system runs on ordinary English text, it uses punctuation and capitalization as features that contribute to its decisions. In order to learn how much degradation in performance is caused by the absence of these features from SNOR format, we performed the following experiment.</Paragraph> <Paragraph position="1"> We took a corpus that had full punctuation and mixed case and preprocessed it to make three new versions: one with all upper case letters but punctuation preserved, one with original case but punctuation marks removed, and one with both case and punctuation removed. We then partitioned all four versions of the corpus into a training set and a held-out test set, using the same partition in all four versions, and measured IdentiFinder's performance.</Paragraph> <Paragraph position="2"> The corpus we used for this experiment was the transcriptions of the second 100 hours of the Broadcast News acoustic modelling data, comprising 114 episodes. We partitioned this data to form a training set of 98 episodes (640,000 words) and a test set of 16 episodes (130,000 words). Because the test transcriptions were created by humans, they have a 0% word error rate. The results are shown in Table 3-1.</Paragraph> <Paragraph position="3"> The removal of case information has the greater effect, reducing performance by 2.3 points, while the loss of punctuation reduces performance by 1.4 points. The loss from removing both features is 3.4 points, less than the sum of the individual degradations. This suggests that there are some events where both mixed case and punctuation are required to lead IdentiFinder to the correct answer.</Paragraph> <Paragraph position="4"> performance(F-measure) on Broadcast News data It should be noted that because the data are transcriptions of speech, no version of the corpus contains all the textual clues that would appear in newspaper text like the MUC-7 New York Times data. In particular, numbers are written out in words as they would be spoken, not represented using digits, and abbreviations such as &quot;Dr.&quot;, &quot;Jr.&quot; or &quot;Sept.&quot; are expanded out to their full spoken word. We conclude that the degradation in performance going from newspaper text to SNOR recognizer output is at least 3.4 points in the 0% WER case, and probably more due to these other missing text clues.</Paragraph> </Section> <Section position="7" start_page="317" end_page="318" type="metho"> <SectionTitle> 4 Effect of Word Errors </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="317" end_page="318" type="sub_section"> <SectionTitle> 4.1 Optical Character (OCR) Recognition </SectionTitle> <Paragraph position="0"> The OCR experiments were performed using the system described in Makhoul et al. (1998).</Paragraph> <Paragraph position="1"> Recognition was performed at the character level, rather than the word level, so the vocabulary is not closed (unlike the ASR results discussed in subsequent sections). Figure 4-1 shows IdentiFinder's performance under 4 conditions of varying word error rate (WER): 1. Original text (no OCR, 0% WER) 2. OCR from high-quality (laser-printed) text images (2.7% WER) 3. OCR on degraded images (13.7% WER).</Paragraph> <Paragraph position="2"> 4.&quot; OCR on degraded images, processed with a weak character language model (19.1% WER) For the second and third conditions, 1.3M characters of Wall Street Journal were used for OCR language model training: the fourth condition used a much weaker character language model, which accounts for the poorer performance.</Paragraph> <Paragraph position="3"> The interpolated line has been fit to the performance of the OCR-based systems, with a slope indicating 0.6 points of F-measure lost for each percentage point increase in word error. The line has been extrapolated to 0% WER: the actual 0% WER condition is 95.4, which only slightly exceeds the projected value.</Paragraph> <Paragraph position="4"> all speech systems in the 1998 Hub-4 evaluations (Przybocki, et al., 1999). These experiments were run in co-operation with NIST. The interpolated line has been fit to the errorful transcripts, and then extrapolated out to 0% WER speech. As can be seen, the line fits the data extremely well, and has a slope of 0.7 points of F-measure lost for each additional 1% of word error rate. The fact that the extrapolated ' These figures do not reflect the best possible performance of the OCR system: for example, when testing on degraded data, it would be usual to include representative data in training. This was not a concern for this experiment, however, which focussed on name finding performance.</Paragraph> <Paragraph position="5"> line slightly overestimates the actual performance at 0% WER (given by a A) indicates that the degradation may be sub-linear in the range 0-15% WER.</Paragraph> <Paragraph position="6"> performance as a function of word error rate (in cooperation with NIST)</Paragraph> </Section> </Section> <Section position="8" start_page="318" end_page="318" type="metho"> <SectionTitle> 5 Out of Vocabulary Rates for Names </SectionTitle> <Paragraph position="0"> It is generally agreed that out-of-vocabulary (OOV) words do not have a major impact on the word error rate achieved by large vocabulary speech recognizers doing transcription. The reason is that speech lexicons are designed to include the most frequent words, thus ensuring that OOV words will represent only a small fraction of the words in any test set. However, we have seen that the ,OOV rate for words that are part of named-entities can be as much as a factor of ten greater than the baseline OOV for non-name words. This could make OOV a major problem for NE extraction from speech.</Paragraph> <Paragraph position="1"> To explore this, we measured the percentage of names in the Broadcast News data that contain at least one OOV word as a function of lexicon size. For this purpose, we built lexicons simply by ordering the words of the 1998 Hub-4 Language Modeling data according to frequency, and truncating the list at various lengths. The percentage of in-vocabulary events of each type as a function of lexicon size is shown in Table 5-1.</Paragraph> <Paragraph position="2"> Most modem speech recognizers employ a vocabulary of roughly 60,000 words; using a larger lexicon introduces more errors from acoustic perplexity than it fixes through enlarged vocabulary. It is clear from the table that the only name category that might suffer a significant OOV problem with a 60K vocabulary is PERSONs. One might imagine that a more carefully constructed lexicon could reduce the OOV rate for PERSONs while still staying within the 60,000 word limit. However, even if a cleverly designed 60K lexicon succeeded in having the name coverage of the frequency-ordered 120K word lexicon (which contains roughly 40,000 more proper names than the 60K lexicon), it would reduce the PERSON OOV rate by only 4% absolute.</Paragraph> </Section> <Section position="9" start_page="318" end_page="321" type="metho"> <SectionTitle> 6 Effect of training set size </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="318" end_page="320" type="sub_section"> <SectionTitle> 6.1 Automatic Speech Recognition </SectionTitle> <Paragraph position="0"> We have measured NE performance in the context of speech as a function of training set size and found that the performance increases logarithmically with the amount of training data for 15% WER test data as well as for 0% WER input. However the growth rate is slower for 15% WER test data. We constructed small training sets of various size by randomly selecting sets of 6, 12, 25, and 49 episodes from the second 100 hours of annotated Broadcast News training data. We also defined a training set of 98 episodes from the second 100 hours, as well as sets containing the full 98 episodes plus some or all of the first 100 hours of Broadcast News training. Our largest training set contained 1.2 million words, and our smallest a mere 30,000 words. All training data were converted to SNOR format.</Paragraph> <Paragraph position="1"> Given that PERSONs account for roughly 50% of the named-entities in broadcast news, the maximum gain in F measure available for doubling the lexicon size is 2 points. Moreover, this gain would require that every PERSON name added to the vocabulary be recognized properly -- an unlikely prospect, since most of these words will not appear in the acoustic training for the recognizer. For these reasons, we conclude that the OOV problem is not a major factor in determining NE performance from speech.</Paragraph> <Paragraph position="2"> For each training set, we trained a separate IdentiFinder model and evaluated it on two versions of the 1998 Hub4-IE data -- the 0% WER transcription created by a human, and an ASR transcript with 15%. The results are plotted in Figure 6-1. The slopes of the interpolated lines predict that IdentiFinder's performance on 15% WER speech will increase by 1.5 points for each additional doubling of the training data, while performance goes up 1.8 points per doubling of the training for perfect speech input.</Paragraph> <Paragraph position="3"> Possibly, the difference in slope of the two lines is that the real value of increasing the training set lies in increasing the number of distinct rare names that appear. Once an example is in the training, IdentiFinder is able to extract it and use it in test. However, when the test data is recognizer output, the rare names are less likely to appear in the test, either because they don't appear in the speech lexicon or they are poorly trained in the speech model and misrecognized.</Paragraph> <Paragraph position="4"> If they don't appear in the test, IdentiFinder can't make full use of the additional training, and thus performance on errorful input increases more slowly than it does on error-free input text.</Paragraph> </Section> <Section position="2" start_page="320" end_page="321" type="sub_section"> <SectionTitle> 6.2 Optical Character Recognition </SectionTitle> <Paragraph position="0"> A similar relationship between training size and performance is seen for the OCR test condition.</Paragraph> <Paragraph position="1"> function of training data for speech.</Paragraph> <Paragraph position="2"> The training was partitioned by documents into equal sized sets: to train a separate model, which was then evaluated on the different word error conditions: performance was then averaged across each partition size to produce the data points below. While this graph of this data in Figure 6-2 shows a logarithmic improvement, as with the ASR experiments, the rate of improvement is substantially less, roughly 0.9 increase in F-measure for doubling the training data. This may be explained by the difference in difficulty between the two tests: even with only 77.5k words of training, the 0% WER performance exceeds the ASR system trained on 1.2M words. full point, while on recognizer produced output, performance goes u~p by only 0.3 points.</Paragraph> </Section> </Section> <Section position="10" start_page="321" end_page="322" type="metho"> <SectionTitle> 8 Related Work and Future Work 7 Effect of Lists </SectionTitle> <Paragraph position="0"> Like most NE extraction systems, IdentiFinder can use lists of strings of known to be names to estimate the probability that a word will be a name, given that it appears on a particular list.</Paragraph> <Paragraph position="1"> We trained two models on 1.2 million words of SNOR data, one with lists and one without. We tested on the human transcription (0% WER) and the ASR (15% WER) versions of the 1998 evaluation transcripts. Table 7-1 shows the results. We see that on human constructed transcripts, lists improve the performance by a To our knowledge, no other information extraction technology has been applied to OCR material.</Paragraph> <Paragraph position="2"> For audio materials, three related efforts were benchmarked on NE extraction from broadcast news. Palmer, et al. (1999) employs an HMM very similar to that reported for IdentifFinder (Bikel et al., 1997,1999). Renals et al. (1999) reports on a rule-based system and an HMM integrated with a speech recognizer. Appelt and Martin (1999) report on the TEXTPRO system, which recognises names using manually written finite state sales.</Paragraph> <Paragraph position="3"> Of these, the Palmer system and TEXTPRO report results on five different word error rates. Both degrade linearly, about .7F, with each 1% increase in WER from ASR. None report the effect of training set size, capitalization, punctuation, or out-of-vocabulary items.</Paragraph> <Paragraph position="4"> Of the four systems, IdentiFinder represents state-of-the-art performance. Of all the systems evaluated, those with the simple architecture of ASR followed by information extraction performed markedly better than the system where extraction was more integrated with ASR.</Paragraph> <Paragraph position="5"> In general, these results compare favorably with results reported in the Message Understanding Conference (Chinchor, et al., 1998). The highest NE score in MUC-7 was 93.39; for 0% WER, our best score was 90.5 without case and punctuation which costs about 3.4 points.</Paragraph> </Section> class="xml-element"></Paper>