File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-1031_evalu.xml

Size: 3,532 bytes

Last Modified: 2025-10-06 14:00:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1031">
  <Title>Named Entity Scoring for Speech Input</Title>
  <Section position="6" start_page="203" end_page="204" type="evalu">
    <SectionTitle>
3. Experiments and Results
</SectionTitle>
    <Paragraph position="0"> To validate our scoring algorithm, we developed a small test set consisting of the Broadcast News development test for the 1996 HUB4 evaluation (Garofolo 97). The reference transcription (179,000 words) was manually annotated with NE information (6150 entities). We then performed a number of scoring experiments on two sets of transcription/NE hypotheses generated automatically from the same speech data. The first data that we scored was the result of a commonly available speech recognition system, which was then automatically tagged for NE by our system Alembic (Aberdeen 95). The second set of data that was scored was made availabe to us by BBN, and was the result of the BYBLOS speech recognizer and IdentiFinder TM NE extractor (Bikel 97, Kubala 97, 98). In both cases, the NE taggers were run on the reference transcription as well as the corresponding recognizer's output.</Paragraph>
    <Paragraph position="1"> These data were scored using the original MUC scorer as well as our own scorer run in two modes: the three-component mode described above, with an extent threshold of 1, and a &amp;quot;MUC mode&amp;quot;, intended to be backwardcompatible with the MUC scorer, s We show the results in Table 2.</Paragraph>
    <Paragraph position="2"> First, we note that when the underlying texts are identical, (columns A and I) our new scoring algorithm in MUC mode produces the same result as the MUC scorer. In normal mode, the scores for the reference text are, of course, higher, because there are no content errors. Not surprisingly, we note lower NE performance on recognizer output. Interestingly, for both the Alembic system (S+A) and the BBN system that counts substitution errors only once.</Paragraph>
    <Paragraph position="3"> SOur scorer is configurable in a variety of ways. In particular, the extent and content components can be combined into a single component, which is judged to be correct only if the individual extent and content are correct. In this mode, and with the extent threshold described above set to zero, the scorer effectively replicates the MUC algorithm.</Paragraph>
    <Paragraph position="4">  (B+I), the degradation is less than we might expect: given the recognizer word error rates shown, one might predict that the NE performance on recognizer output would be no better than the NE performance on the reference text times the word recognition rate. One might thus expect scores around 0.31 (i.e., 0.65x0.47) for the Alembic system and 0.68 (i.e., 0.85x0.80) for the BBN system. However, NE performance is well above these levels for both systems, in both scoring modes.</Paragraph>
    <Paragraph position="5"> We also wished to determine how sensitive the NE score was to the alignment phase. To explore this, we compared the SCLite and phonetic alignment algorithms, run on the S+A data, with increasing levels of extent tolerance, as shown in Table 3. As we expected, the NE scores converged as the extent tolerance was relaxed.</Paragraph>
    <Paragraph position="6"> This suggests that in the case where a phonetic alignment algorithm is unavailable (as is currently the case for languages other than English), robust scoring results might still be achieved by relaxing the extent tolerance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML