File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-2028_evalu.xml

Size: 4,695 bytes

Last Modified: 2025-10-06 13:58:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2028">
  <Title>Spoken Interactive ODQA System: SPIQA</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Evaluation Experiments
</SectionTitle>
    <Paragraph position="0"> Questions consisting of 69 sentences read aloud by seven male speakers were transcribed by our ASR system. The question transcriptions were processed with a screening filter and input into the ODQA engine. Each question consisted of about 19 morphemes on average. The sentences were grammatically correct, formally structured, and had enough information for the ODQA engine to extract the correct answers. The mean word recognition accuracy obtained by the ASR system was 76%.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Screening filter
</SectionTitle>
      <Paragraph position="0"> Screening was performed by removing recognition errors using a confidence measure as a threshold and then summarizing it within an 80% to 100% compaction ratio. In this summarization technique, the word significance and linguistic score for summarization were calculated using text from Mainichi newspapers published from 1994 to 2001, comprising 13.6M sentences with 232M words. The SDCFG for the word concatenation score was calculated using the manually parsed corpus of Mainichi newspapers published from 1996 to 1998, consisting of approximately 4M sentences with 68M words. The number of non-terminal symbols was 100. The posterior probability of each transcribed word in a word graph obtained by ASR was used as the confidence score.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 DDQ module
</SectionTitle>
      <Paragraph position="0"> The word generality score A G was computed using the same Mainichi newspaper text described above, while the SDCFG for the dependency ambiguity score A D for each phrase was the same as that used in (C. Hori et. al., 2003). Eighty-two types of interrogative sentences were created as disambiguating queries for each noun and noun-phrase in each question and evaluated by the DDQ module. The linguistic score L indicating the appropriateness of interrogative sentences was calculated using 1000 questions and newspaper text extracted for three years. The structural ambiguity score A D was calculated based on the SDCFG, which was used for the screening filter.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Evaluation method
</SectionTitle>
      <Paragraph position="0"> The DQs generated by the DDQ module were evaluated in comparison with manual disambiguation queries. Although the questions read by the seven speakers had sufficient information to extract exact answers, some recognition errors resulted in a loss of information that was indispensable for obtaining the correct answers. The manual DQs were made by five subjects based on a comparison of the original written questions and the transcription results given by the ASR system. The automatic DQs were categorized into two classes: APPROPRIATE when they had the same meaning as at least one of the five manual DQs, and INAPPRO-PRIATE when there was no match. The QA performance in using recognized (REC) and screened questions (SCRN) were evaluated by MRR (Mean Reciprocal Rank) (http://trec.nist.gov/data/qa.html).</Paragraph>
      <Paragraph position="1"> SCRN was compared with the transcribed question that just had recognition errors removed (DEL). In addition, the questions reconstructed manually by merging these questions and additional information requested the DQs generated by using SCRN, (DQ) were also evaluated. The additional information was extracted from the original users' question without recognition errors. In this study, adding information by using the DQs was performed only once.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Evaluation results
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the evaluation results in terms of the appropriateness of the DQs and the QA-system MRRs. The results indicate that roughly 50% of the DQs generated by the DDQ module based on the screened results were APPROPRIATE. The MRR for manual transcription (TRS) with no recognition errors was 0.43. In addition, we could improve the MRR from 0.25 (REC) to 0.28 (DQ) by using the DQs only once. Experimental results revealed the potential of the generated DQs in compensating for the degradation of the QA performance due to recognition errors.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML