File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1118_metho.xml
Size: 15,226 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1118"> <Title>Text Categorization Using Automatically Acquired Domain Ontology</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Information Map </SectionTitle> <Paragraph position="0"> InfoMap can serve as domain ontology as well as an inference engine. InfoMap is designed for NLP applications; its basic function is to identify the event structure of a sentence. We shall briefly describe InfoMap in this section. Figure 1 gives example ontology of the Central News Agency (CNA), the target in our experiment.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 InfoMap Structure Format </SectionTitle> <Paragraph position="0"> As a domain ontology, InfoMap consists of domain concepts and their related sub-concepts such as categories, attributes, activities. The relationships of a concept and its associated sub-concepts form a tree-like taxonomy. InfoMap also defines references to connect nodes from different branches which serves to integrate these hierarchical concepts into a network. InfoMap not only classifies concepts, but also connects the concepts by defining the relationships among them.</Paragraph> <Paragraph position="1"> In InfoMap, concept nodes represent concepts and function nodes represent the relationships between concepts. The root node of a domain is the name of the domain. Following the root node, important topics are stored in a hierarchical order. These topics have sub-categories that list related sub-topics in a recursive fashion. Figure 1 is a partial view of the domain ontology of the CNA.</Paragraph> <Paragraph position="2"> Under each domain there are several topics and each topic might have sub-concepts and associated attributes. In this example, note that, the domain ontology is automatically acquired from a domain corpus, hence the quality is poor. Figure 2 shows the skeleton order of a concept using InfoMap.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Event Structure </SectionTitle> <Paragraph position="0"> Since concepts that are semantically related are often clustered together, one can use InfoMap to discern the main event structure in a natural language sentence. The process of identifying the event structure, we call a firing mechanism, which matches words in a sentence to both concepts and relationships in InfoMap.</Paragraph> <Paragraph position="1"> Suppose keywords of concept A and its subconcept B (or its hyponyms) appear in a sentence. It is likely that the author is describing an event &quot;B of A&quot;. For example, when the words &quot;tire&quot; and &quot;car&quot; appear in a sentence, normally this sentence would be about the tire of a car (not tire in the sense of fatigue). Therefore, a word-pair with a semantic relationship can give more concrete information than two words without a semantic relationship. Of course, certain syntactic constraints also need to be satisfied. This can be extended to a noun-verb pair or a combination of noun, verb and adjective. We call such words in a sentence an event structure. This mechanism seems to be especially effective for Chinese sentences.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Domain Speculation </SectionTitle> <Paragraph position="0"> With the help of domain ontologies, one can categorize a piece of text into a specific domain by categorizing each individual sentence within the text. There are many different ways to use domain ontology to categorize text. It can be used as a dictionary, as a keyword lists and as a structure to identify NL events. Take a single sentence for example. We first use InfoMap as a dictionary to do word segmentation (necessary for Chinese sentences) in which the ambiguity can be resolved by checking the domain topic in the ontology.</Paragraph> <Paragraph position="1"> After words are segmented, we can examine the distribution of these words in the ontology and effectively identify the densest cluster. Thus, we can use InfoMap to identify the domains of the sentences and their associated keywords. Section</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 will further elaborate on this. </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. Automatic Ontology Acquisition </SectionTitle> <Paragraph position="0"> The automatically domain ontology acquisition from a domain corpus has three steps: 1. Identify the domain keywords. 2. Find the relative concepts.</Paragraph> <Paragraph position="1"> 3. Merge the correlated activities.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Domain Keyword Identification </SectionTitle> <Paragraph position="0"> The first step of automatic domain ontology acquisition is to identify domain keywords.</Paragraph> <Paragraph position="1"> Identifying Chinese unknown words is difficult since the word boundary is not marked in Chinese corpus. According to an inspection of a 5 million word Chinese corpus (Chen et al., 1996), 3.51% of words are not listed in the CKIP lexicon (a Chinese lexicon with more than 80,000 entries).</Paragraph> <Paragraph position="2"> We use reoccurrence frequency and fan-out numbers to characterize words and their boundaries according to PAT-tree (Chien, 1999).</Paragraph> <Paragraph position="3"> We then adopt the TF/IDF classifier to choose domain keywords. The domain keywords serve as the seed topics in the ontology. We then apply SOAT to automatically obtain related concepts.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 SOAT </SectionTitle> <Paragraph position="0"> To build the domain ontology for a new domain, we need to collect domain keywords and concepts by finding relationships among keywords. We adopt a semi-automatic domain ontology acquisition tool (SOAT, Wu et al., 2002), to construct a new ontology from a domain corpus.</Paragraph> <Paragraph position="1"> With a given domain corpus, SOAT can build a prototype of the domain ontology.</Paragraph> <Paragraph position="2"> InfoMap uses two major relationships among concepts: taxonomic relationships (category and synonym) and non-taxonomic relationships (attribute and action). SOAT defines rules, which consist of patterns of keywords and variables, to capture these relationships. The extraction rules in SOAT are morphological rules constructed from part-of-speech (POS) tagged phrase structure.</Paragraph> <Paragraph position="3"> Here we briefly introduce the SOAT process: Input: domain corpus with the POS tag Output: domain ontology prototype Steps: 1 Select a keyword (usually the name of the domain) in the corpus as the seed to form a potential root set R 2 Begin the following recursive process: 2.1 Pick a keyword A as the root from R 2.2 Find a new related keyword B of the root A by extraction rules and add it into the domain ontology according to the rules 2.3 If there is no more related keywords, remove A from R 2.4 Put B into the potential root set Repeat step 2 until either R becomes empty or the total number of nodes reach a threshold</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Morphological Rules </SectionTitle> <Paragraph position="0"> To find the relative words of a keyword, we check the context in the sentence from which the keyword appears. We can then find attributes or hyponyms of the keyword. For example, in a sentence, we find a noun in front of a keyword (say, computer) may form a specific kind of concept (say, quantum computer). A noun (say, connector) followed by &quot;of&quot; and a keyword may be an attribute of the keyword, (say, connector of computer). See (Wu et al., 2002) for details.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Ontology Merging </SectionTitle> <Paragraph position="0"> Ontologies can be created by merging different resources. One NLP resource that we will merge into our domain ontology is the noun-verb event frame (NVEF) database (Tsai and Hsu, 2002).</Paragraph> <Paragraph position="1"> NVEF is a collection of permissible noun-verb sense-pairs that appear in general domain corpora.</Paragraph> <Paragraph position="2"> The noun will be the subject or object of the verb.</Paragraph> <Paragraph position="3"> This noun-verb sense-pair collection is domain independent. We can use nouns as domain keywords and find their correlated verbs. Adding these verbs into the domain ontology makes the ontology more suitable for NLP. The correlated verbs are added under the action function node.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Ontology-Based Text Categorization </SectionTitle> <Paragraph position="0"> To incorporate the domain ontology into a text categorization, we have to adjust both the training process and testing process. Section 4.1 describes how to make use of the ontology and the event structure during the training process. Section 4.2 describes how to use ontology to perform domain speculation. Section 4.3 describes how to categorize news clippings.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Feature and Threshold Selection </SectionTitle> <Paragraph position="0"> With the event structure matched (fired) in the domain ontology, we have more features with which to index a text. To select useful features and a proper threshold, we apply Microsoft Decision Tree Algorithm to determine a path's relevance as this algorithm can extract human interpretable rules (Soni et al., 2000).</Paragraph> <Paragraph position="1"> Features of the event structure include event structure score, node score, fired node level, and node type. During the training process, we record all features of the event structure fired by the news clippings in the domain-categorized training corpus. The decision tree shows that a threshold of 0.85 is sufficient to evaluate event structure scores. We use event structure score to determine if the path is relevant. According to Figure 3, if the threshold of true probability is 85%, then the event structure score (Pathscore in the figure) should be 65.75. And the relevance of a path p is true if p falls in a node on the decision tree whose ratio of true instance is greater than l .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Domain Speculation </SectionTitle> <Paragraph position="0"> The goal of domain speculation is to categorize a sentence S into a domain D j according to the combined score of the keywords and the event structure in sentence S. We first calculate the similarity score of S and D j . The keyword score and the event structure score are calculated independently.</Paragraph> <Paragraph position="2"> We use the TF/IDF classifier (Salton, 1989) to calculate the Keyword_Score of a sentence as follows. First, we use a segmentation module to split a Chinese sentence into words. The TF/IDF classifier represents a domain as a weighted vector,</Paragraph> <Paragraph position="4"> ), where n is the number of words in this domain and w</Paragraph> <Paragraph position="6"> is the term frequency (i.e., the number of times the word w k occurs in the domain j). Let DF k be the number of domains in which word k appears and |D |the total number of domains. idf k , the inverse document frequency, is given by:</Paragraph> <Paragraph position="8"> This weighting function assigns high values to domain-specific words, i.e. words which appear frequently in one domain and infrequently in others. Conversely, it will assign low weights to words appearing in many domains. The similarity between a domain j and a sentence represented by</Paragraph> <Paragraph position="10"> The event structure score is calculated by InfoMap Engine. First, find all the nodes in ontology that match the words in the sentence.</Paragraph> <Paragraph position="11"> Then determine if there is any concept-attribute pair, or hypernym-hyponym pair. Finally, assign a score to each fired event structure according to the string length of words that match the nodes in the ontology. The selected event structure is the one with the highest score.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 News Categorization </SectionTitle> <Paragraph position="0"> Upon receiving a news clipping C, we split it into</Paragraph> <Paragraph position="2"> . The sentences are scored and categorized according to domains. Thus, every sentence has an individual score for each domain</Paragraph> <Paragraph position="4"> ). We add up these scores of every sentence in the text according to domain, giving us total domain scores for the entire text. The domain which has the highest score is the domain into which the text is categorized.</Paragraph> <Paragraph position="5"> The advantage of ontology compared to other implicit knowledge representation mechanism is that it can be read, interpreted and edited by human. Noise and errors can be detected and refined, especially for the automatically acquired ontology, in order to obtain a better ontology. Another advantage of allowing human editing is that the ontology produced can be shared by various applications, such as from a QA system to a knowledge management system. In contrast, the implicit knowledge represented in LSI or other representations is difficult to port from one application to another.</Paragraph> <Paragraph position="6"> In this section, we show how the human editing feature improves news categorization. First, we can identify a common error type: ambiguity; then, depending on the degree of categorization ambiguity, the system can report to a human editor the possible errors of certain concepts in the domain ontology as clues.</Paragraph> <Paragraph position="7"> Consider the following common error type: event structure ambiguity. Some event structures are located in several domains due to the noise of training data. We define two formulas to find such event structures. The ambiguity of an event</Paragraph> <Paragraph position="9"> ) is proportional to the number of domains in which it appears, and inversely proportional to its event score, where S</Paragraph> <Paragraph position="11"> are the sentences that fire event E.</Paragraph> <Paragraph position="13"> We also measure the similarity between every two event structures by calculating the co-occurrence multiplied by the global categorization ambiguity factor.</Paragraph> <Paragraph position="15"> exceeds a threshold, the system will suggest that the human editor refine the ontology.</Paragraph> </Section> </Section> class="xml-element"></Paper>