File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/c04-1068_concl.xml
Size: 1,945 bytes
Last Modified: 2025-10-06 13:53:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1068"> <Title>Filtering Speaker-Specific Words from Electronic Discussions</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we have identified features of electronic discussions that influence clustering performance, and presented a filtering mechanism that removes adverse influences. The effect of our filtering mechanism was evaluated by means of two experiments: coarse-level clustering and simple information retrieval. Our results show that filtering out the signature words of dominant speakers has a positive effect on clustering and retrieval performance.</Paragraph> <Paragraph position="1"> Although these experiments were performed at a coarser level of granularity than that of our target domain, our results indicate that filtering signature words is a promising pre-processing step for clustering electronic discussions.</Paragraph> <Paragraph position="2"> From a more qualitative perspective, we clearly saw the benefit of the filtering mechanism in the example in Section 3.3 (Tables 2 and 3): when a generation component is used to describe the contents of clusters, the inclusion of author-specific words is uninformative and even confusing.</Paragraph> <Paragraph position="3"> Our approach to filtering is general in the sense that we do not target specific parts of electronic discussions (e.g. the last few lines of a posting) for filtering. We have experimented with a more naive approach that removes all web and email addresses from a posting (they account for a significant portion of a signature). However, this simple heuristic yielded only a small improvement in clustering performance. More importantly, it clearly does not generalise to deal with the problem of identifying and removing author-specific terminology.</Paragraph> </Section> class="xml-element"></Paper>