File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/j98-1003_abstr.xml

Size: 31,311 bytes

Last Modified: 2025-10-06 13:49:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="J98-1003">
  <Title>Topical Clustering of MRD Senses Based on Information Retrieval Techniques</Title>
  <Section position="2" start_page="0" end_page="72" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Word sense disambiguation (WSD) has been found useful in many natural language processing (NLP) applications, including information retrieval (Krovetz and Croft 1992; McRoy 1992), machine translation (Brown et al. 1991; Dagan, Itai, and Schwall 1991; Dagan and Itai 1994), and speech synthesis (Yarowsky 1992). WSD has received increasing attention in recent literature on computational linguistics (Lesk 1986; Schi.itze 1992; Gale, Church, and Yarowsky 1992; Yarowsky 1992, 1995; Bruce and Wiebe 1995; Luk 1995; Ng and Lee 1996; Chang et al. 1996). Given a polysemous word in running text, the task of WSD involves examining contextual information to determine the intended sense from a set of predetermined candidates. It is a nontrivial task to divide the senses of a word and determine this set, for word sense is an abstract concept frequently based on subjective and subtle distinctions in topic, register, dialect, collocation, part of speech, and valency (McRoy 1992). Various approaches to word sense division have been proposed in the literature on WSD, including (1) sense numbers in every-day dictionaries (Lesk 1986; Cowie, Guthrie, and Guthrie 1992), (2) automatic or hand-crafted clusters of dictionary senses (Dolan 1994; Bruce and Wiebe 1995; Luk * Department of Computer Science, National Tsing Hua University, Hsinchu 30043, Taiwan, ROC. E-mail: dr818314@cs.nthu.edu.tw; jschang@cs.nthu.edu.tw. (~ 1998 Association for Computational Linguistics Computational Linguistics Volume 24, Number 1 1995), (3) thesaurus categories (Yarowsky 1992; Chen and Chang 1994), (4) translation in another language (Gale, Church, and Yarowsky 1992; Dagan, Itai, and Schwall 1991; Dagan and Itai 1994), (5) automatically induced clusters with sublexical representation (Schiitze 1992), and (6) hand-crafted lexicons (McRoy 1992).</Paragraph>
    <Paragraph position="1"> This paper is motivated by the observation that directly using dictionary senses for sense division offers several advantages. Sense distinction according to a dictionary is readily available from machine-readable dictionaries (MRDs) such as the Longman Dictionary of Contemporary English (LDOCE) (Proctor 1978). A dictionary such as the LDOCE has broad coverage of word senses, useful for WSD. Furthermore, indicative words and concepts for each sense are directly available in numbered definitions and examples. Lesk (1986) describes the first MRD-based WSD method that relies on the extent of overlap between words in a dictionary definition and words in the local context of the word to be disambiguated. The author reports that WSD performance ranges from 50% to 70% and his method works well for senses strongly associated with specific collocations, such as ice-cream cone and pine cone.</Paragraph>
    <Paragraph position="2"> Unfortunately, using MRDs as the knowledge source for sense division and disambiguation leads to some problems. Zernik (1992) notes that the dictionary dichotomy of senses is inadequate for WSD, because it is defined along grammatical, not semantic, lines. Furthermore, as pointed out in Dolan (1994), the sense division in an MRD is frequently too fine-grained for the purpose of WSD. A WSD system based on dictionary senses often faces unnecessary and difficult &amp;quot;forced-choices.&amp;quot; Dolan proposes a heuristic algorithm for forming unlabeled clusters of closely related senses in the LDOCE to eliminate distinctions that are unnecessarily fine for WSD. Regrettably, the proposed algorithm was only described in a few examples and was not developed further. Lacking an automatic method, recent WSD works (Bruce and Wiebe 1995; Luk 1995; Yarowsky 1995) still resort to human intervention to identify and group closely related senses in an MRD.</Paragraph>
    <Paragraph position="3"> Using thesaurus categories directly as a coarse sense division may seem to be a viable alternative (Yarowsky 1995). However, typical thesauri, such as Roget's Thesaurus (1987), suffer sense gaps and, occasionally, are too fine-grained. Yarowsky (1992) reports that there are uses not listed in Roget's for 3 of 12 nouns in his WSD study, while uses which a native speaker might consider as a single sense are often encoded in several Roget's categories.</Paragraph>
    <Paragraph position="4"> As an alternative approach to word sense division, this paper presents an algorithm capable of automatically clustering senses in an MRD based on topical information in a thesaurus. We refer to the algorithm as TopSense (Topical clustering of Senses). The current implementation of TopSense uses the topical information in the Longman Lexicon of Contemporary English (LLOCE) (McArthur 1992) to cluster LDOCE senses. The method makes use of none of the idiosyncratic information in either the LLOCE or the LDOCE. Therefore, the TopSense algorithm is quite general and is expected to produce comparable results for other MRDs and thesauri. TopSense is tested on 20 words extensively investigated in recent WSD literature (Schi~tze 1992; Yarowsky 1992; Luk 1995). According to the experimental results, the automatically derived topical clusters can be used to good effect without any human intervention as a coarse sense division for WSD.</Paragraph>
    <Paragraph position="5"> The rest of the paper is organized as follows. Section 2 starts out with a description of the MRDs and thesauri used in the computational lexicography and WSD literature, followed by some observations to justify the topic-based approach to word sense division. Section 3 describes the LinkSense algorithm for linking senses between an MRD and a thesaurus. Section 4 shows how the TopSense algorithm based on the IR model may be used to cluster the senses in an MRD. Examples are given in both  Chen and Chang Topical Clustering Sections 3 and 4 to illustrate how the algorithms work. Section 4 also describes an implementation of the algorithms for the LDOCE and the LLOCE and reports the evaluation results for both algorithms based on a 20-word test set. Section 5 analyzes the experimental results to demonstrate the strengths and limitations of the method.</Paragraph>
    <Paragraph position="6"> The implication of TopSense to WSD and other issues related to lexical semantics are also touched upon. Section 6 compares the proposed method with other approaches in the computational linguistics literature. Finally, conclusions are made and directions for further research are pointed out in Section 7.</Paragraph>
    <Paragraph position="7"> 2. Word Senses in Machine-Readable Dictionaries and Thesauri In this section, we look at two knowledge sources of word sense division, which are currently widely available, namely, the dictionary and the thesaurus. A good-sized dictionary usually has a large vocabulary and broad coverage of word senses, both of which are useful for WSD. However, a dictionary's division of senses for a given word is often too fine for the task of WSD. On the other hand, a thesaurus organizes word senses into a fixed set of coarse semantic categories, making it more appropriate for WSD. However, thesauri tend to have a smaller vocabulary and a narrower coverage of word senses. To get the best of both worlds of dictionary and thesaurus, we propose to cluster MRD definitions to yield a broad-coverage sense division with the granularity of a thesaurus. Therefore, a short description of MRDs and thesauri is in order.</Paragraph>
    <Section position="1" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
2.1 Fine-Grained Senses in an MRD
</SectionTitle>
      <Paragraph position="0"> Interest in MRD-based research has increased over the years; in particular, the LDOCE and Webster's Seventh New Collegiate Dictionary (W7) (1967) have drawn much attention.</Paragraph>
      <Paragraph position="1"> Much of the MRD-based research has focused on the analysis and exploitation of the sense definitions in MRDs (Amsler 1984a, 1984b, 1987; Alshawi 1987; Alshawi, Boguraev, and Carter 1989; Vossen, Meijs, and denBroeder 1989). In these works, the definitions are analyzed using either a parser (Montemagni and Vanderwende 1992) or a pattern matcher (Ahlswede and Evens 1988) into semantic relations. These relations are then used for various tasks, ranging from the interpretation of a noun sequence (Vanderwende 1994) or a prepositional phrase (Ravin 1990), to resolving structural ambiguity (Jenson and Binot 1987), to merging dictionary senses for WSD (Dolan 1994). Besides the definition itself, there is an abundance of information listed in a dictionary entry, including part of speech, subcategory, examples, collocations, and typical arguments, which is potentially useful for WSD. In this regard, the LDOCE is particularly appropriate since it uses a reduced, controlled vocabulary of some 2,000 words to define over 60,000 word senses representing a comprehensive vocabulary and broad coverage of word senses.</Paragraph>
      <Paragraph position="2"> It is arguable that the dictionary division of senses for a given word is too finegrained, thus inadequate for WSD. For instance, it might not always be necessary or easy to distinguish between two LDOCE senses bank.l.n.1 (river bank) and bank.l.n.5 (sandbank) shown in Table 1. Hence, dictionary senses can be used to good effect for WSD only if such closely related senses are merged and treated as one. There is more than one way to merge dictionary senses. In the following sections, we describe one such approach, under which MRD senses are merged according to the sense granularity of a typical thesaurus.</Paragraph>
    </Section>
    <Section position="2" start_page="62" end_page="66" type="sub_section">
      <SectionTitle>
2.2 Coarse Senses in Thesauri: WordNet, Roget's, and LLOCE
</SectionTitle>
      <Paragraph position="0"> One of the most potentially valuable aspects of the thesaurus, as a knowledge source for word sense division, is the organization of word senses into a limited number of  land along the side of a river, lake, etc.</Paragraph>
      <Paragraph position="1"> earth which is heaped up in a field or garden, often making a border or division. a mass of snow, clouds, mud, etc.</Paragraph>
      <Paragraph position="2"> a slope made at bends in a road or race-track, so that they are safer for cars to go round.</Paragraph>
      <Paragraph position="3"> = SANDBANK. (a high underwater bank of sand in a river, harbour, etc.) (of a car or aircraft) to move with one side higher than the other, esp. when making a turn a row, esp. of OARs in an ancient boat or KEYs on a TYPEWRITER. a place in which money is kept and paid out on demand, and where related activities go on.</Paragraph>
      <Paragraph position="4"> (usu. in comb.) a place where something is held ready for use, esp. ORGANIC products of human origin for medical use.</Paragraph>
      <Paragraph position="5"> (a person who keeps) a supply of money or pieces for payment or use in a game of chance.</Paragraph>
      <Paragraph position="6"> to put or keep (money) in a bank.</Paragraph>
      <Paragraph position="7"> \[esp. with\] to keep one's money (esp. in the stated bank)  D 447-594 Intellect: the exercise of the mind E 595-816 Volition: the exercise of the will F 817-990 Emotion, religion and morality coarse semantic categories. We briefly describe the on-line thesauri, WordNet (Miller et al. 1993), Roget's Thesaurus, and LLOCE, which have been used as word sense divisions in the computational linguistics literature. WordNet is organized as a set of hierarchical, conceptual taxonomies of nouns, verbs, adjectives, and adverbs called synsets. The synsets are too fine-grained from the WSD perspective; WordNet contains 24,825 noun synsets for 32,264 distinct nouns with a total of 43,136 senses in its noun taxonomy alone. It would be difficult to acquire WSD knowledge for making such fine distinctions even from a substantial body of training materials. Roget's Thesaurus arranges words in a three-layer hierarchy and organizes over 30,000 distinct words into some 1,000 categories on the bottom layer. These categories are divided into 39 middle-layer sections that are further organized as 6 top-layer classes. Each category is given a three-digit reference code. To make the hierarchical structure explicit, an uppercase letter from A to F is added to the reference code to denote the top-layer class for each category, as indicated in Table 2. Similarly, the middle layer is denoted with a lowercase reference letter. The sections related to class B (Space) are shown in Table 3. Therefore, the reference code for each category is denoted by an uppercase class letter, a lowercase section letter, and a three-digit category number.  categorization scheme* For instance, the word bank listed under Category 209 in Roget's will be prefixed an additional letter B to denote the class Space, plus a lowercase letter b to denote the section Dimensions; the reference code 209 is replaced with Bb209. Figure 1 shows the information for the word bank in Roget's.</Paragraph>
      <Paragraph position="8"> WordNet and Roget's to a lesser degree present word senses that are too fine-grained for WSD. Often, uses that a native speaker might consider as a single sense are encoded in several Roget's categories or WordNet synsets. For instance, a single LDOCE sense bank.4.n.1 shown in Table 1 corresponds to two WordNet synsets Depository financial institution and Bank building and two Roget's categories, 799 (Treasurer) and 784 (Lending). Similarly, the Roget's lists two categories 234 (Edge) and 344 (Land) for a concept treated as one word sense, bank.l.n.1 in the LDOCE. Table 4 provides further details.</Paragraph>
      <Paragraph position="9"> The LLOCE is a hierarchical thesaurus that organizes word senses primarily according to subject matter. The LLOCE contains over 23,000 different senses for some 15,000 distinct words. The coarser senses in LLOCE are organized into approximately 2,500 topical word sets. These sets are divided into 129 topics and these topics are further organized as fourteen subjects. The subjects are denoted with alphabetical reference letters from A to N (see Table 5). Thus the LLOCE subject, topic, and topical set constitutes a three-level hierarchy, in which each subject contains 7 to 12 topics and each topic contains 10 to 50 sets of related words. Table 6 displays the topics related</Paragraph>
      <Paragraph position="11"> Life and living things Body; its function and welfare People and family Buildings, houses, home, clothes, belongings, personal care Food, drink, and farming Feeling, emotions, attitudes, and sensations Thought, communication, language, and grammar Substance, materials, objects, and equipment Arts/Crafts, science/technology, industry/education Numbers, measurement, money, and commerce Entailment, sports and games Space and time Movement, location, travel, and transportation General and abstract terms to subject L (Space and time). Each topical set is given a three-digit reference code; however, this code does not explicitly reflect the topic. To make use of the information related to a topic, we have designated a lowercase letter to each topic. Therefore, each set is denoted by an uppercase &amp;quot;subject&amp;quot; letter, a lowercase &amp;quot;topic&amp;quot; letter, and a three-digit &amp;quot;topical set&amp;quot; number. For instance, the word bank listed under L99 in the LLOCE will be given an additional reference letter d to denote the topic Geography; the reference code L99 is replaced with Ld099. The LLOCE also provides cross-references between sets and topics to indicate various intersense relations not captured within the same topic. For instance, topic Ld (Geography) has a cross-reference to topic Me (Place). Figure 2 shows LLOCE's topical classification and cross-references related to the word bank.</Paragraph>
      <Paragraph position="12"> The LLOCE, and, to a lesser degree, Roget's, are based on coarse, topical semantic classes, making them more appropriate for WSD than the finer-grained WordNet synsets. The 129 topics in the LLOCE or 990 categories in Roget's appear to be suffi- null sun, moon, star, left, right, etc.</Paragraph>
      <Paragraph position="13"> light, dark, ray, color, white, black, etc.</Paragraph>
      <Paragraph position="14"> weather, sky, rain, snow, rain, ice, etc.</Paragraph>
      <Paragraph position="15"> stream, sea, lake, flood, to flow, etc.</Paragraph>
      <Paragraph position="16"> time, history, frequent, permanent, etc.</Paragraph>
      <Paragraph position="17"> start, stop, late, last, etc.</Paragraph>
      <Paragraph position="18"> ancient, modem, future, age, etc.</Paragraph>
      <Paragraph position="19"> day, night, second, minute, etc.</Paragraph>
      <Paragraph position="20"> now, soon, always, ever, after, etc.</Paragraph>
      <Paragraph position="21"> &amp;quot;L9 &gt; O,o , .... ..... 1/2- ......</Paragraph>
      <Paragraph position="22"> ... bank ...... bank ..</Paragraph>
      <Paragraph position="24"> LLOCE's topical organization of word sense.</Paragraph>
      <Paragraph position="25"> cient for representing the distinction we would want to make for the task of WSD. Roget's has been used as the sense division in two recent WSD works (Yarowsky 1992; Luk 1995) more or less as is, except for a small number of senses added to fill gaps. We contend that a sense division based on the LLOCE topics will offer more or less the same kind of granularity, suitable for WSD. For instance, in Yarowsky (1992), the senses of star are divided into three Roget's categories, which roughly correspond to five LDOCE star senses labeled with LLOCE topics. In the same study, six Roget's categories are sufficient to distinguish the senses of slug. These six categories correspond to five relevant LLOCE topics. Table 7 provides further details.</Paragraph>
    </Section>
    <Section position="3" start_page="66" end_page="68" type="sub_section">
      <SectionTitle>
2.3 Combining Word Sense Information from an MRD and a Thesaurus
</SectionTitle>
      <Paragraph position="0"> It should be clear by now that combining a dictionary and a thesaurus leads to a broad-coverage sense division with a suitable granularity for WSD. The obvious way to combine the two would be to disambiguate and link a sense definition D of a headword h in the dictionary to an entry relevant to D in the thesaurus. This amounts to a special case of WSD with respect to thesaurus senses. There is no simple solution  to the general WSD problem for unrestricted text, but we will show that this special case of disambiguating MRD definitions is significantly easier, for several reasons. First, the words used in a definition sentence are limited primarily to a small set; in the case of the LDOCE, the controlled vocabulary consists of some 2,000 words.</Paragraph>
      <Paragraph position="1"> For instance, in the first five LDOCE senses of bank shown in Table 1, all defining words are in the controlled vocabulary, except for the word SANDBANK, shown in capital letters. Obtaining WSD information for this small set of words obviously is much easier than it would be for a large, open set.</Paragraph>
      <Paragraph position="2"> Second, dictionary definitions adhere to rather rigid patterns under which only words with predictable semantic relations show up. A dictionary definition, in general, begins with a genus term (that is, conceptual ancestor of the sense), followed by a set of differentiae that are words semantically related to the sense to provide the specifics. The semantic relations between the sense, the genus, and differentiae are reflected in what are termed categorical, functional, and situational clusters in McRoy (1992).</Paragraph>
      <Paragraph position="3"> The semantic relations and clusters have been shown to be very effective knowledge sources for such NLP tasks as WSD (McRoy 1992) and interpretation of noun sequences (Vanderwende 1994). For instance, in the first four definitions of bank in Table 1, the genus terms land, earth, mass, and slope are categorically related to the respective bank senses. On the other hand, the differentiae river, lake, field, garden, bend, road, and race-track have a LocationOf situational relation with bank. Other differentiae, snow, cloud, and mud, are related functionally to bank.l.n.3 through the MakeOf relation.</Paragraph>
      <Paragraph position="4"> Third, for the most part these relations are captured implicitly in a typical thesaurus. The LLOCE and Roget's conveniently contain information on the relations in the form of word lists under a topic (category) or cross-referencing to other topics.</Paragraph>
      <Paragraph position="5"> Therefore, an MRD sense definition can be effectively disambiguated based on the word lists and cross-references in a thesaurus. A simple heuristic relying on the similarity between a sense's defining keywords and thesaurus word lists suffices to link an MRD sense to its relevant sense in the thesaurus. For instance, the differentiae (land, side, river, lake) of bank.l.n.1 is sufficiently similar to the word list of Ld-topic (Geography) to warrant the link between LDOCE sense bank.l.n.1 and LLOCE sense bank-Ld099.</Paragraph>
      <Paragraph position="6"> The topics and cross-references of LLOCE in general capture the Generic~Specific relation; therefore, a sense definition is often disambiguated through the genus. Thus, the task of linking MRD and thesaurus senses is closely related to the extraction and  Chen and Chang Topical Clustering disambiguation of the genus. For instance, in the above example, linking bank.l.n.1 to bank-Ld099 has, as a by-product, the disambiguation of the genus land to land-Ld084 (Geography) rather than land-Ce078 (Social organization in groups and place). Details of extraction and disambiguation of the genus can be found in previous works (Guthrie et al. 1990; Klavans, Chodorow, and Wacholder 1990; Copestake 1990; Ageno et al.</Paragraph>
      <Paragraph position="7"> 1992). Disambiguated genus and differentiae terms can be used to construct a better taxonomy of word senses.</Paragraph>
      <Paragraph position="8"> Since the dictionary usually has broader coverage of word senses than the thesaurus, not all MRD senses of a headword h correspond to one of h's predefined senses in the thesaurus. For instance, LDOCE sense bank.l.n.3 (a mass of cloud, snow, or mud, etc.) corresponds to LLOCE topic Hb (Object generally) rather than any of the predefined LLOCE senses for bank. Therefore, such entries represent sense gaps in the thesaurus and should be left unlinked. Nevertheless, the linked entries are enough training material for topical clustering of MRD senses, as described in Section 4.</Paragraph>
      <Paragraph position="9"> 3. Linking an MRD to a Thesaurus This section describes how to establish a link between an MRD sense and its relevant word sense in a thesaurus, if such a link exists. We start with the preprocessing steps for the sense definition, which are necessary for the algorithm to obtain good results. Then we describe the linking algorithm step by step. Finally, we show illustrative examples to give some idea how the proposed algorithm works for the LLOCE and Roget's.</Paragraph>
    </Section>
    <Section position="4" start_page="68" end_page="70" type="sub_section">
      <SectionTitle>
3.1 Preprocessing Steps
</SectionTitle>
      <Paragraph position="0"> Although only simple words are usually used in sense definitions, most of these words are also highly ambiguous. For instance, the two instances of lies listed in the two following LDOCE sense definitions differ in meaning: couch.2.n.2 a bed-like piece of furniture on which a person lies when being examined by a doctor.</Paragraph>
      <Paragraph position="1"> lie detector an instrument that is supposed to show when a person is telling lies. Notably, their parts-of-speech are also different. Determining the part of speech of each instance allows us to limit the range of possible meanings. The first instance of lies is a verb that means &amp;quot;to be in a flat resting position&amp;quot; or &amp;quot;to tell a lie.&amp;quot; On the other hand, the second instance is a nominal with a unique meaning &amp;quot;a false statement purposely made to deceive.&amp;quot; By tagging the definition with part-of-speech information, the degree of sense ambiguity in the definition can be reduced, thereby increasing the chance of successful linking.</Paragraph>
      <Paragraph position="2"> Part-of-Speech Tagging. Various methods for POS tagging have been proposed in recent years. For simplicity, we adopted the method proposed by Church (1988) to tag definition sentences. Experiments indicated an average error rate for tagging of less than 10%. Tagging errors have limited negative impact, because words in the LLOCE are organized primarily according to topic, not part of speech. The POS information is used to remove function words, as well as to look up words in the LLOCE with matching POS. The part-of-speech preprocessing phase is mandatory for the algorithm to exclude some inappropriate candidates for topics. See Table 8 for some examples of tagged LDOCE definition sentences.</Paragraph>
      <Paragraph position="3">  land/n side/n river/n lake/n earth/n heap/v field/n garden/n border/n division/n mass/n snow/n clouds/n mud/n slope/n bend/n road/n race-track/n cars/n Removal of Stopwords. In general, function words in the definition are only marginally relevant to the meaning being defined. This is also true of words used in many definitions. For this reason, IR systems commonly exclude stopwords from the process of indexing and query. This also applies to our situation of retrieving topics relevant to the meaning of a sense based on the words in its definition. The list of all the stop-words is specifically designed to remove pronouns, determiners, prepositions, and conjunctions. Table 9 shows that the meaning of some definitions of bank is found to be quite intact, even after stopwords are removed.</Paragraph>
      <Paragraph position="4"> Calculating Similarity between Definition and Thesaurus Class. When viewing the definition of a headword h as a set of words, it becomes easy to compare and measure their similarity to thesaurus word classes containing h. By word classes, we mean any supersets of synonym sets in a thesaurus that capture the semantic relations and semantic clusters that are effective for disambiguation as described in Section 2.3. The word classes are so chosen that they contain enough words to overlap with the sense definition in question. But each class should not be so big as to cover more than one thesaurus sense for h, blurring the distinction we want to make in the first place. Topics in the LLOCE and categories or sections in Roget's are good choices for such classes. Similarity between the defining keywords and a class of words reflects how closely the definition is related to the class. As a simple heuristic, the intended meaning of a dictionary definition D for h is disambiguated in favor of a relevant sense T for h in a thesaurus class C with the highest similarity to D. When such a sense T is found, we say that the dictionary sense D is linked to the thesaurus sense T or that D is linked to the thesaurus class C (containing T.) For a headword h, let DEFh denote the definitions of h and let CLASSh be the word classes in a thesaurus that contain h. For a definition D E DEFh, our problem amounts to finding C E CLASSh that is relevant to D. With these terms, the unweighted Dice coefficient can be adopted to measure similarity between a definition D and a class C as follows: Sim(D, C) = ZdEKEYD 2 X W d X In(d, C) \]KEYD\[ + \[C\[ ' where KEYD = the set of words in definition D E DEFh, \[KEYD\[ = number of words in  Chen and Chang Topical Clustering 1 KEYD, C E CLASSh -= a relevant class to h in the thesaurus, w k ~- degree of ambiguity of k' In(a, B) ---- 1, when a E B, and In(a, B) = 0, when a C/ B.</Paragraph>
      <Paragraph position="5"> The above similarity measure may be improved by taking into consideration specific features of a particular thesaurus. For instance, the cross-reference features in the LLOCE or the intersense relations in Roget's are very effective in reflecting semantic relatedness; thus, they should be included in this similarity measure. Let REFc represent the cross-referenced classes for the word class C in the thesaurus. Thus, we have</Paragraph>
      <Paragraph position="7"> in REFc.</Paragraph>
    </Section>
    <Section position="5" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
3.2 The LinkSense Algorithm
</SectionTitle>
      <Paragraph position="0"> We sum up the above description and outline the procedure for labeling senses on a dictionary entry as follows:</Paragraph>
    </Section>
    <Section position="6" start_page="70" end_page="72" type="sub_section">
      <SectionTitle>
3.3 Illustrative Examples: Linking LDOCE to LLOCE and Roget's
</SectionTitle>
      <Paragraph position="0"> Two examples are given in this subsection to illustrate how LinkSense works to establish linkage between a typical dictionary and thesaurus. Example 1 shows, step by step, how LinkSense links up an LDOCE sense, interest.l.n.2 (a share in a company business etc.) with the relevant LLOCE sense interest-Je (Banking). 2 Example 2 is intended to show that LinkSense is quite general and applies to thesauri other than the LLOCE.</Paragraph>
      <Paragraph position="1"> The same LDOCE senses will be shown to links to a relevant Roget's sense interest-Ei (Possessive relation).</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 24, Number 1</Paragraph>
      <Paragraph position="4"> All three keywords appear in three different topics but only the following classes are relevant to ccaSSinterest: De (share), Je (share), Cc (company) Thus, we have  Sim'(D, Fj) = O, Sim'(D, Ka) = 0.</Paragraph>
      <Paragraph position="5"> = 0.133, Step 6: The LDOCE sense, interest.l.n.3 is linked to the LLOCE sense, interest-Je. Example 2 Linking an LDOCE sense interest.l.n.3 to its relevant Roget's sense. Step 1-3: The first three steps are independent of the thesaurus used, therefore  the same results as in Example 1 should be obtained.</Paragraph>
    </Section>
    <Section position="7" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
3.4 Performance Evaluation of LinkSense
</SectionTitle>
      <Paragraph position="0"> An experiment involving the LDOCE and the LLOCE was carried out to assess the effectiveness of the LinkSense algorithm (see Table 10). To evaluate the performance of algorithms, we define the ratios of applicability A and precision P as follows: #(all labeled definitions) A =  Nearly half of the nominal LDOCE senses for a set of highly polysemous words are linked to their relevant LLOCE sense and topics, with a surprisingly high precision rate of 93%. For the other hail LinkSense does not find sufficiently high similarity to warrant a link. That is due primarily (approximately two-thirds) to sense gaps in the LLOCE, rather than inconsistency among the LDOCE definitions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML