File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1118_intro.xml
Size: 3,838 bytes
Last Modified: 2025-10-06 14:02:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1118"> <Title>Text Categorization Using Automatically Acquired Domain Ontology</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Domain ontology, consisting of important concepts and relationships of the concepts in the domain, is useful in a variety of applications (Gruber, 1993). However, evaluating the quality of domain ontologies is not straightforward. Reusing an ontology for several applications can be a practical method for evaluating domain ontology.</Paragraph> <Paragraph position="1"> Since text categorization is a general tool for information retrieval, knowledge management and knowledge discovery, we test the ability of domain ontology to categorize news clips in this paper.</Paragraph> <Paragraph position="2"> Traditional IR methods use keyword distribution form a training corpus to assign testing document. However, using only keywords in a training set cannot guarantee satisfactory results since authors may use different keywords. We believe that, news clip events are categorized by concepts, not just keywords. Previous works shows that the latent semantic index (LSI) method and the n-gram method give good results for Chinese news categorization (Wu et al., 1998).</Paragraph> <Paragraph position="3"> However, the indices of LSI and n-grams are less meaningful semantically. The implicit rules acquired by these methods can be understood by computers, not humans. Thus, manual editing for exceptions and personalization are not possible and it is difficult to further reuse these indices for knowledge management.</Paragraph> <Paragraph position="4"> With good domain ontology we can identify the concept structure of sentences in a document. Our idea is to compile the concepts within documents in a training set and use these concepts to understand documents in a testing set. However, building rigorous domain ontology is laborious and time-consuming. Previous works suggest that ontology acquisition is an iterative process, which includes keyword collection and structure reorganization. The ontology is revised, refined, and accumulated by a human editor at each iteration (Noy and McGuinness, 2001). For example, in order to find a hyponym of a keyword, the human editor must observe sentences containing this keyword and its related hyponyms (Hearst, 1992). The editor then deduces rules for finding more hyponyms of this keyword. At each iteration the editor refines the rules to obtain better quality pairs of keyword-hyponyms. To speed up the above labor-intensive approach, semi-automatic approaches have been designed in which a human editor only has to verify the results of the acquisition (Maedche and Staab, 2000).</Paragraph> <Paragraph position="5"> A knowledge representation framework, Information Map (InfoMap) in our previous work (Hsu et al., 2001), has been designed to integrate various linguistic, common-sense and domain knowledge. InfoMap is designed to perform natural language understanding, and applied to many application domains, such as question answering (QA), knowledge management and organization memory (Wu et al., 2002), and shows good results. An important characteristic of InfoMap is that it extracts events from a sentence by capturing the topic words, usually subject-verb pairs or hypernym-hyponym pairs, which are defined in the domain ontology.</Paragraph> <Paragraph position="6"> We shall review the InfoMap ontology framework in Section 2. The ontology acquisition process and extraction rules will be introduced in Section 3. We describe ontology-based text categorization in Section 4. Experimental results are reported in Section 5. We conclude our work in Section 6.</Paragraph> </Section> class="xml-element"></Paper>