File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1110_evalu.xml
Size: 5,942 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1110"> <Title>Semantic Similarity Applied to Spoken Dialogue Summarization</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Evaluation Metrics and Baselines </SectionTitle> <Paragraph position="0"> We reformulated the problem in terms of standard information retrieval evaluation metrics:</Paragraph> <Paragraph position="2"> is the number of cases where the individual scoring method and the Gold Standard agree. PNr is computed according to the definition given in Section 3.</Paragraph> <Paragraph position="3"> Annotator 1 Annotator 2 Annotator 3 Gold Standard 1 Gold Standard 2 NP is the total number of utterances marked as relevant in the Gold Standard. For comparison, three baseline systems were implemented. The first system is the RANDOM baseline, where relevant utterances (depending on the compression rate) were selected by chance. The second baseline system is based on the TF*IDF scoring metric. A large corpus is required to make this method fully powerful. Therefore, we computed TF*IDF scores for every word on the basis of 2431 Switchboard dialogues (ca. 19.3 MB of ASCII text). Then, an average TF*IDF score for each utterance of the 20 dialogues in our corpus was computed by adding the individual scores for all words in the utterance and normalizing by the number of words. The LEAD baseline is based on the intuition that the most important utterances tend to occur at the beginning of the discourse. While this observation is true for the domain of news, the LEAD baseline is not necessarily efficient for the genre of spontaneous dialogues. However, given the Switchboard experimental data collection setup, the dialogues usually directly start with the discussions of the topic. This hypothesis was supported by evidence from our own annotation experiments, too.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> Experiments were performed using the semantic similarity package V0.05 (Pedersen, 2002) and WordNet 1.7.1. We employed Gold Standard 2 (see Section 4). Three of the methods, namely res, lin, jcn, require the information content file (ICF). A method for computing the information content of concepts from large corpora of text is given in Resnik (1995). ICF contains a list of synsets along with their part of speech and frequency count. We compare the results obtained with 2 different ICFs: a WordNet-based ICF, provided at the time of the installation of the similarity package with pre-computed frequency values on the basis of WordNet (WD ICF); an ICF, generated specifically on the basis of 2431 Switchboard dialogues with the help of utilities distributed together with the similarity package (SW ICF).</Paragraph> <Paragraph position="1"> Figures 1 and 2 indicate the performance of all methods in terms of F-measure. The results of the semantic similarity methods making use of the information content file generally improve when the Switchboard-based ICF is used. The improvements are especially significant for the jcn and lin measures, while this does not seem to be the case for the res measure (depending on a specific compression rate).</Paragraph> <Paragraph position="2"> The summarization methods perform best for the compression rates in the interval [20,30]. Given these rates and the Switchboard-based ICF, the competing methods display the following performance (in descending order): jcn, res, lin, lesk, lch, tf*idf, lead, random. For the default ICF the picture is slightly different: res, jcn and lesk, lch, lin, tf*idf, lead, random (see Table 8). lch relying on WordNet structure only performs worse than the rest of similarity metrics incorporating some corpus evidence. A direct comparison of our evaluation with alternative results, e.g., Zechner's (2002) is problematic. Though Zechner's results are based on Switchboard, too, he employs a different evaluation scheme. The evaluation is broken down to the word level. The results are compared with multiple human annotations instead of a Gold Standard.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Statistical Significance and Error Analysis </SectionTitle> <Paragraph position="0"> For determining whether there is a significant difference between the summarization approaches pairwise, we use a paired related t-test (as the parent distribution is unknown). The null hypothesis states there is no difference between the two distributions.</Paragraph> <Paragraph position="1"> On consulting the t-test tables, we obtain the significance values presented in Table 9, given the compression rate 25%4 and the Switchboard ICF. These results indicate that there is no statistically significant difference in the performance between the res, lin, jcn and lesk methods. However, all of them significantly outperform the LEAD, TF*IDF and RANDOM baselines.</Paragraph> <Paragraph position="2"> The maximum Recall of the semantic similarity-based summarization methods in the current implementation is limited to about 90%, given COMP = 100%. This means that if the system compiled a 100% &quot;summary&quot;, it would miss 10% of all utterances marked as relevant. The reason lies in the fact that the algorithm operates on the concepts created by mapping nouns to their WordNet senses. Thus, the relevant utterances which do not have nouns on the surface, but contain for example anaphorical expressions realized as pronouns, are missed in the input. Resolving anaphorical expressions in the pre-processing stage may eliminate this error source.</Paragraph> </Section> </Section> class="xml-element"></Paper>