File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1164_intro.xml
Size: 7,382 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1164"> <Title>Automated Alignment and Extraction of Bilingual Ontology for Cross-Language Domain-Specific Applications</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Methodologies </SectionTitle> <Paragraph position="0"> Figure 1 shows the block diagram for ontology construction. There are two major processes in the proposed system: bilingual ontology alignment and domain ontology extraction.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Bilingual Ontology Alignment </SectionTitle> <Paragraph position="0"> In this approach, the cross-lingual ontology is constructed by aligning the words in WordNet to their corresponding words in HowNet.</Paragraph> <Paragraph position="1"> The hierarchical taxonomy is actually a conversion of HowNet. One of the important portions of HowNet is the methodology of defining the lexical entries. In HowNet, each lexical entry is defined as a combination of one or more primary features and a sequence of secondary features. The primary features indicate the entry's category, namely, the relation: &quot;is-a&quot; which is in a hierarchical taxonomy. Based on the category, the secondary features make the entry's sense more explicit, but they are non-taxonomic. Totally 1,521 primary features are divided into 6 upper categories: Event, Entity, Attribute Value, Quantity, and Quantity Value. These primary features are organized into a hierarchical taxonomy.</Paragraph> <Paragraph position="2"> First, the Sinorama (Sinorama 2001) database is adopted as the bilingual language parallel corpus to compute the conditional probability of the words in WordNet, given the words in HowNet. Second, a bottom up algorithm is used for relation mapping.</Paragraph> <Paragraph position="3"> In WordNet a word may be associated with many synsets, each corresponding to a different sense of the word. For finding a relation between two different words, all the synsets associated with each word are considered (Fellbaum 1998). In HowNet, each word is composed of primary features and secondary features. The primary features indicate the word's category. The purpose of this approach is to increase the relation and structural information coverage by aligning the above two language-dependent ontologies, WordNet and HowNet, with their semantic features.</Paragraph> <Paragraph position="4"> Figure 1 Ontology construction framework The relation &quot;is-a&quot; defined in WordNet corresponds to the primary feature defined in HowNet. Equation (2) shows the mapping between the words in HowNet and the synsets in WordNet.</Paragraph> <Paragraph position="5"> Given a Chinese word, i CW , the probability of the word related to synset, k synset , can be obtained via its corresponding English synonyms,</Paragraph> <Paragraph position="7"> where {enitity,event,act,play} is the concept set in the root nodes of HowNet and WordNet.</Paragraph> <Paragraph position="8"> Finally, the Chinese concept,</Paragraph> <Paragraph position="10"> as long as the probability, Pr</Paragraph> <Paragraph position="12"> zero. Figure 2(a) shows the concept tree generated by aligning WordNet and HowNet.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Domain ontology extraction </SectionTitle> <Paragraph position="0"> There are two phases to construct the domain ontology: 1) extract the ontology from the cross-language ontology by the island-driven algorithm, and 2) integrate the terms and axioms defined in a medical encyclopaedia into the domain ontology.</Paragraph> <Paragraph position="1"> 2.2.1 Extraction by island-driven algorithm Ontology provides consistent concepts and world representations necessary for clear communication within the knowledge domain. Even in domain-specific applications, the number of words can be expected to be numerous. Synonym pruning is an effective alternative to word sense disambiguation.</Paragraph> <Paragraph position="2"> This paper proposes a corpus-based statistical approach to extracting the domain ontology. The steps are listed as follows: Step 1 Linearization: This step decomposes the tree structure in the universal ontology shown in Figure 2(a) into the vertex list that is an ordered node sequence starting at the leaf nodes and ending at the root node.</Paragraph> <Paragraph position="3"> Step 2 Concept extraction from the corpus: The node is defined as an operative node when the Tf-idf value of word</Paragraph> <Paragraph position="5"> W in the domain corpus is higher than that in its corresponding contrastive (out-of-domain) corpus. That is, are the frequencies of word</Paragraph> <Paragraph position="7"> W in the domain documents and its contrastive (out-of-domain) documents, respectively.</Paragraph> <Paragraph position="9"> W in the domain documents and its contrastive documents, respectively. The nodes with bold circle in Figure 2(a) represent the operative nodes.</Paragraph> <Paragraph position="10"> Step 3 Relation expansion using the island-driven algorithm: There are some domain concepts not operative after the previous steps due to the problem of sparse data. From the observation in ontology construction, most of the inoperative concept nodes have operative hypernym nodes and hyponym nodes. Therefore, the island-driven algorithm is adopted to activate these inoperative concept nodes if their ancestors and descendants are all operative. The nodes with gray background shown in Figure 2(a) are the activated operative nodes.</Paragraph> <Paragraph position="11"> Step 4 Domain ontology extraction: The final step is to merge the linear vertex list sequence into a hierarchical tree. However, some noisy concepts not belonging to this domain ontology are operative. These nodes with inoperative noisy concepts should be filtered out. Finally, the domain ontology is extracted and the final result is shown in Figure 2(b).</Paragraph> <Paragraph position="12"> After the above steps, a dummy node is added as the root node of the domain concept tree.</Paragraph> <Paragraph position="13"> In practice, specific domain terminologies and axioms should be derived and introduced into the ontology for domain-specific applications. There are two approaches to add the terminologies and axioms: the first one is manual editing by the ontology engineers, and the other is to obtain from the domain encyclopaedia.</Paragraph> <Paragraph position="14"> For medical domain, we obtain 1213 axioms derived from a medical encyclopaedia about the terminologies related to diseases, syndromes, and the clinic information. Figure 3 shows an example of the axiom. In this example, the disease &quot;diabetes&quot; is tagged as level &quot;A&quot; which represents that this disease is frequent in occurrence. And the degrees for the corresponding syndromes represent the causality between the disease and the syndromes. The axioms also provide two fields &quot;department of the clinical care&quot; and &quot;the category of the disease&quot; for medical information retrieval or other medical applications.</Paragraph> <Paragraph position="15"> Figure 3 One example of the axioms</Paragraph> </Section> </Section> class="xml-element"></Paper>