File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0909_intro.xml
Size: 3,115 bytes
Last Modified: 2025-10-06 14:01:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0909"> <Title>A cross-comparison of two clustering methods</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Among all the applications in Natural Language Processing (NLP), many require semantic knowledge about topics in order to be possible or to be efficient. These applications are, for example, topic segmentation and identification or text classification. As this kind of knowledge is not easy to build manually, we developed a system, SEGAPSITH (Ferret and Grau, 1998a), (Ferret and Grau, 1998b), to acquire it automatically. In this field, there are two classes of approaches.</Paragraph> <Paragraph position="1"> Supervised learning that requires to know a priori which topics have to be learned and to possess a tagged corpus as a learning set. It is the approach generally adopted by the different systems, as those participating to TREC or TDT.</Paragraph> <Paragraph position="2"> However, we wanted to design a system allowing us to work in open domain, without any restriction about the subjects to be represented and, thus, to be recognized in texts. SEGAPSITH is grounded on an unsupervised and incremental learning based on a conceptual clustering method.</Paragraph> <Paragraph position="3"> After a thematic segmentation of the texts that divides a text in segments made of lemmatized words, i. e. thematic units, the system aggregates similar enough thematic units. Aggregation consists of regrouping all the words of the different similar units and associating to them a weight according to their occurrence number. This weight represents the importance of a word relative to the described topic. The incremental aspect allows us to augment topic knowledge by treating successive corpora without reconsidering the knowledge already existing.</Paragraph> <Paragraph position="4"> In such an approach, an important problem consists of the validation of the learned classes.</Paragraph> <Paragraph position="5"> As we do not possess an existing classification that agrees with the granularity level of our classes, we decided to accomplish this evaluation by using a second classification method on the same data and by comparing their results. This second method is an entropy-based method, and requires to know the number of classes to form.</Paragraph> <Paragraph position="6"> So, if both results are similar enough, although the methods applied are different, we could conclude that the learned classes are quite relevant and that the two methods are efficient.</Paragraph> <Paragraph position="7"> After applying the second method, we possess two sets composed by the same number of classes. Each class regroups thematic units and is described by a set of words. So, we established different criteria to compare them, based either on the words as class descriptors or on the thematic units they gather. After the presentation of the two methods, we shall present our tests and the first results we obtained.</Paragraph> </Section> class="xml-element"></Paper>