File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1006_metho.xml
Size: 17,930 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1006"> <Title>Semi-Automatic Practical Ontology Construction by Using a Thesaurus, Computational Dictionaries, and Large Corpora</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Ontology Design </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.1 Basic Principles </SectionTitle> <Paragraph position="0"> Although no formal principles exist to determine the structure or content of our ontology, we can suggest some principles underlying our methodology. Firstly, an ontology for natural language processing (NLP) must provide concepts for representing word meanings in the lexicon and store selectional constraints of concepts, which enable inferences using the network of an ontology (Onyshkevych, 1997).</Paragraph> <Paragraph position="1"> These inferences can assist in metaphor and metonymy processing, as well as word sense disambiguation. For these reasons, an ontology becomes an essential knowledge source for high quality NLP, although it is difficult and time-consuming to construct. Secondly, an ontology can be effortlessly shared by any application and in any domain (Gruber, 1993; Karp et al., 1999; Kent, 1999). More than two different ontologies in a certain domain can produce a semantic mismatch problem between concepts. Further, if The Sejong electronic dictionary has been developed by several Korean linguistic researchers, funded by Ministry of Culture and Tourism, Republic of Korea.</Paragraph> <Paragraph position="2"> (http://www.sejong.or.kr) you wish to apply an existing ontology to a new application, it will often be necessary to convert the structure of the ontology to a new one.</Paragraph> <Paragraph position="3"> Thirdly, an ontology must support language independent features, because constructing ontologies for each language is inefficient.</Paragraph> <Paragraph position="4"> Fourthly, an ontology must have capabilities for users to easily understand, search, and browse.</Paragraph> <Paragraph position="5"> Therefore, we define a suitable ORL to support these principles.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Ontology Representation Language </SectionTitle> <Paragraph position="0"> Many knowledge representation languages are built specifically to share knowledge among different knowledge representation systems.</Paragraph> <Paragraph position="1"> Five types of ORLs were reviewed, such as FRAMEKIT (Nirenburg et al., 1992), Ontolingua (Gruber, 1993), CycL (Lenat et al., 1990), XOL (Karp et al., 1999), and Ontology Markup Language (OML) (Kent, 1999).</Paragraph> <Paragraph position="2"> According to their semantics, FRAMEKIT and XOL adopt frame representation, CycL and Ontolingua use an extended first order predicate calculus, and the OML is based on conceptual graphs (CGs). Excepting FRAMEKIT and CycL, the other ORLs have not yet been applied to build any large-scale ontology.</Paragraph> <Paragraph position="3"> Among this variety of ORLs, we chose the simplified OML as the ORL of our LIP ontology, which is based on Extensible Markup Language (XML) and CGs. Since XML has a well-established syntax, it is reasonably simple to parse, and XML will be widely used, because it has many software tools for parsing and manipulating, and a human readable representation. We intend to leave room for improvement by adopting the semantics of CGs, because the present design of our LIP ontology is for the specific purpose of disambiguating word senses. In future, however, we must extend its structure and content to build an interlingual meaning representation during semantic analysis in machine translation. Sowa's CGs (1984) is a widely-used knowledge representation language, consisting of logic structures with a graph notation and several features integrated from semantic net and frame representation. Globally, many research teams are working on the extension and application of CGs in many domains.</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="3" type="metho"> <SectionTitle> 3 Ontology Construction </SectionTitle> <Paragraph position="0"> Many ontologies are developed for purely theoretical purposes, or are constructed as language-dependent computational resources, such as WordNet and EDR. However, they are seldom constructed as a language-independent computational resource.</Paragraph> <Paragraph position="1"> To construct a language-independent and practical ontology, we developed two strategies. First, we introduced the same number and grain size of concepts of the Kadokawa thesaurus and its taxonomic hierarchy into the LIP ontology. The thesaurus has 1,110 Kadokawa semantic categories and a 4-level hierarchy as a taxonomic relation (see Figure 1). This approach is a moderate shortcut to construct a practical ontology which easily enables us to utilize its results, since some resources are readily available, such as bilingual dictionaries of COBALT-J/K and COBALT-K/J. In these bilingual dictionaries, nominal and verbal words are already annotated with concept codes from the Kadokawa thesaurus. By using the same sense inventories of these MT systems, we can easily apply and evaluate our LIP ontology without additional lexicographic works. In addition, the Kadokawa thesaurus has proven to be useful for providing a fundamental foundation to build lexical disambiguation knowledge in COBALT-J/K and COBALT-K/J MT systems (Li et al., 2000).</Paragraph> <Paragraph position="2"> The second strategy to construct a practical ontology is to extend the hierarchy of the Kadokawa thesaurus by inserting additional semantic relations into its hierarchy. The additional semantic relations can be classified as case relations and other semantic relations. Thus far, case relations have been used occasionally to disambiguate lexical ambiguities in the form of valency information and case frame, but other semantic relations have not, because of the problem of discriminating them from each other, making them difficult to recognize. We define a total of 30 semantic relation types for WSD by referring mainly to the Sejong electronic dictionary and the Mikrokosmos ontology (Mahesh, 1996), as shown in Table 1. These semantic relation types cannot express all possible semantic relations existing among concepts, but experimental results demonstrated their usefulness for WSD.</Paragraph> <Paragraph position="3"> Two approaches are used to obtain these additional semantic relations, which will be inserted into the LIP ontology. The first imports relevant semantic information from existing computational dictionaries. The second applies the semi-automatic corpus analysis method (Li et al., 2000). Both approaches are explained in Section 3.1 and 3.2, respectively.</Paragraph> <Paragraph position="4"> Figure 2 displays the overall constructing flow of the LIP ontology. First, we build an initial LIP ontology by importing the existing Kadokawa thesaurus. Each concept inserted into the initial ontology has a Kadokawa code, a Korean name, an English name, a timestamp, and a concept definition. Although concepts can be uniquely identified by the Kadokawa concept codes, their Korean and English names are Root concept nature character change action feeling human disposition society institute things 0 1 2 3 4 5 6 7 8 9 astro- calen- wea- geog- sight plant ani- physi- subs- phenonomy dar ther raphy mal ology tance mena 00 01 02 03 04 05 06 07 08 09 goods drugs food clothes buil- furni- statio- mark tools machding ture nary ine 90 91 92 93 94 95 96 97 98 99 orga- ani- fish insect organ foot& sin- intes- egg sex nism mal tail ews tine 060 061 062 063 064 065 066 067 068 069 supp- writing- count- binder toy doll recreation- sports- music- bell lies tool book thing tool instrument has-member, has-element, contains, material-of, headedby, operated-by, controls, owner-of, represents, symbol-of, name-of, producer-of, composer-of, inventor-of, make, measured-in inserted for the readability and convenience of the ontology developer.</Paragraph> <Section position="1" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 3.1 Dictionary Resources Utilization </SectionTitle> <Paragraph position="0"> Case relations between concepts can be primarily derived from semantic information in the Sejong electronic dictionary and the bilingual dictionaries of MT systems, which are COBALT-J/K and COBALT-K/J.</Paragraph> <Paragraph position="1"> We obtained 7,526 case frames from verb and adjective sub-dictionaries, which contain 3,848 entries. Automatically converting lexical words in the case frame into the Kadokawa concept codes by using COBALT-K/J (see Figure 3 ), we extracted a total of 6,224 case relation instances.</Paragraph> <Paragraph position="2"> The bilingual dictionaries, which contain 20,580 verb and adjective entries, have 16,567 instances of valency information. Semi-automatically converting syntactic relations into semantic relations by using specific rules and human intuition (see Figure 4), we generated 15,956 case relation instances. The specific rules, as shown in Figure 5, are inferred from training samples, which are explained in Section 4.1. These obtained instances may overlap each other, but all instances are inserted only once into the initial LIP ontology.</Paragraph> <Paragraph position="3"> The Sejong electronic dictionary has sub-dictionaries, such as noun, verb, pronoun, adverb, and others. The Yale Romanization is used to represent Korean lexical words.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.2 Corpus Analysis </SectionTitle> <Paragraph position="0"> For the automatic construction of a sense-tagged corpus, we used the COBALT-J/K, which is a high-quality practical MT system developed by POSTECH in 1996. The entire system has been used successfully at POSCO (Pohang Iron and Steel Company), Korea, to translate patent materials on iron and steel subjects. We performed a slight modification on COBALT-J/K so that it can produce Korean translations from Japanese texts with all nominal and verbal words tagged with the specific concept codes of the Kadokawa thesaurus. As a result, a Korean sense-tagged corpus, which has two hundred and fifty thousand sentences, can be obtained from Japanese texts. Unlike English, the Korean language has almost no syntactic constraints on word order as long as a verb appears in the final position. So we defined 12 local syntactic patterns (LSPs) using syntactically-related words in a sentence. Frequently co-occurring words in a sentence have no syntactic relations to homographs but may control their meaning.</Paragraph> <Paragraph position="1"> Such words are retrieved as unordered co-occurring words (UCWs). Case relations are obtained from LSPs, and other semantic occurrence information (CCI), which is composed of LSPs and UCWs, can be extracted by partial parsing and scanning. To select the most probable concept types, Shannon's entropy model is adopted to define the noise of a concept type to discriminate the homograph. Although it processes for concept type discrimination, many co-occurring concept types, which must be further selected, remain in each LSP and UCW.</Paragraph> <Paragraph position="2"> To solve this problem, some statistical processing was automatically applied (Li et al., 2000). Finally, manual processing was performed to generate the ontological relation instances from the generalized CCI, similar to the previous valency information. The results obtained include approximately about 3,701 case relations and 1,650 other semantic relations from 9,245 CCI, along with their frequencies.</Paragraph> <Paragraph position="3"> The obtained instances are inserted into the initial LIP ontology. Table 2 shows the number of relation instances imported into the LIP ontology from the Kadokawa thesaurus, computational dictionaries, and a corpus.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Ontology Application </SectionTitle> <Paragraph position="0"> The LIP ontology is applicable to many NLP applications. In this paper, we propose to use the ontology to disambiguate word senses. All approaches to WSD make use of words in a sentence to mutually disambiguate each other.</Paragraph> <Paragraph position="1"> The distinctions between various approaches lie in the source and type of knowledge made by the lexical units in a sentence.</Paragraph> <Paragraph position="2"> Our WSD approach is a hybrid method, which combines the advantages of corpus-based and knowledge-based methods. We use the LIP ontology as an external knowledge source and secured dictionary information as context information. Figure 6 shows our overall WSD algorithm. First, we apply the previouslysecured dictionary information to select correct senses of some ambiguous words with high precision, and then use the LIP ontology to disambiguate the remaining ambiguous words.</Paragraph> <Paragraph position="3"> The following are detailed descriptions of the procedure for applying the LIP ontology to WSD work.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 Measure of Concept Association </SectionTitle> <Paragraph position="0"> To measure concept association, we use an association ratio based on the information theoretic concept of mutual information (MI), which is a natural measure of the dependence mapping rules with governer concept codes Apply secured dictionary information with high precision 1989). Resnik (1995) suggested a measure of semantic similarity in an IS-A taxonomy, based on the notion of information content. However, his method differs from ours in that we consider all semantic relations in the ontology, not taxonomy relations only. To implement this idea, we bind source concepts (SC) and semantic relations (SR) into one entity, since SR is mainly influenced by SC, not the destination concepts (DC). Therefore, if two entities, < SC, SR>, and DC have probabilities P(<SC, SR>) and P(DC), then their mutual information I(<SC, SR>, DC) is defined as: training data in the form of <SC (governer), SR, DC (dependent), frequency> and the calculation of MI between the LIP ontology concepts. We performed a slight modification on COBALT-K/J and COBALT-J/K to enable them to produce sense-tagged valency information instances with the specific concept codes of the Kadokawa thesaurus. After producing the instances, we converted syntactic relations into semantic relations using the specific rules (see Figure 5) and human intuition. As a result, we extracted sufficient training data from the Korean raw corpus: KIBS (Korean Information Base System, '94-'97) is a large-scale corpus of 70 million words, and the Japanese raw corpus, which has eight hundred and ten thousand sentences. During this process, more specific semantic relation instances are obtained when compared with previous instances obtained in Section 3. Since such specific instances reflect the context of a practical situation, they are also imported into the LIP ontology. Table 3 shows the final number of semantic relations inserted into the LIP ontology.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.2 Locate the Least Weighted Path from </SectionTitle> <Paragraph position="0"> One Ontology Concept to Other Concept If we regard MI as a weight between ontology concepts, we can treat the LIP ontology as a graph with weighted edges. All edge weights are non-negative and weights are converted into penalties by the below formula Pe. c indicate a constant, maximum MI between concepts of the LIP ontology.</Paragraph> <Paragraph position="1"> ),,(),,( DCSRSCIcDCSRSCPe ><[?]=>< So we use the formula below to locate the least weighted path from one concept to the other concept. The score function S is defined as: Here C and R indicate concepts and semantic relations, respectively. By applying this formula, we can verify how well selectional constraints between concepts are satisfied. In addition, if there is no direct semantic relation between concepts, this formula provides a relaxation procedure, which enables it to approximate their semantic relations. This characteristic enables us Apply valency information with high precision to obtain hints toward resolving metaphor and metonymy expressions. For example, when there is no direct semantic relation between concepts such as &quot;school&quot; and &quot;inform,&quot; the inferring process is as follows. The concept &quot;school&quot; is a &quot;facility&quot;, and the &quot;facility&quot; has &quot;social human&quot; as its members. The concept &quot;inform&quot; has &quot;social human&quot; as its agent. Figure 8 presents an example of the best path between these concepts, which is shown with bold lines. To locate the best path, the search mechanism of our LIP ontology applies heuristics as follows. Firstly, a taxonomic relation must be treated as exceptional from other semantic relations, because they inherently lack frequencies between parent and child concepts. So we assign a fixed weight to those edges experimentally.</Paragraph> <Paragraph position="2"> Secondly, the weight given to an edge is sensitive to the context of prior edges in the path. Therefore, our mechanism restricts the number of times that a particular relation can be traversed in one path. Thirdly, this mechanism avoids an excessive change in the gradient.</Paragraph> </Section> </Section> class="xml-element"></Paper>