File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0807_intro.xml
Size: 6,856 bytes
Last Modified: 2025-10-06 14:06:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0807"> <Title>Integration of Hand-Crafted and Statistical Resources in Measuring Word Similarity</Title> <Section position="3" start_page="0" end_page="45" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper proposes a new approach for word similarity measurement, as has been variously used in such NLP applications as smoothing \[Dagan et al., 1994; Grishman and Sterling, 1994\] and word clustering \[Charniak, 1993; Hindle, 1990; Pereira et al., 1993; Tokunaga et al., 1995\].</Paragraph> <Paragraph position="1"> Previous methods for word similarity measurement can be divided into two categories: statistics-based approaches and hand-crafted thesaurus-based approaches.</Paragraph> <Paragraph position="2"> In statistics-based approaches, and namely the &quot;vector space model&quot;, each word is generally represented by a vector consisting of co-occurrence statistics (such as frequency) with respect to other words \[Charniak, 1993\].</Paragraph> <Paragraph position="3"> The similarity between two given words is then computationally measured using two vectors representing those words. One typical implementation computes the relative similarity as the cosine of the angle between two vectors, a method which is also commonly used in information retrieval and text categorization systems to measure the similarity between documents \[Frankes and Baeza-Yates, 1992\]. Since it is based on mathematical methods, this type of similarity measurement has been popular. Besides this, since the similarity is computed based on given co-occurrence data, word similarity can easily be adjusted according to the domain. However, data sparseness is an inherent problem. This fact was observed in our preliminary experiment, despite using statistical information taken from news articles as many as 4 years. Furthermore, in this approach, vectors require O(N 2) memory space, given that N is the number of words, and therefore, large data sizes can prove prohibitive. Note that even if one statically stores possible word similarity combinations, O(N 2) space is required.</Paragraph> <Paragraph position="4"> The other category of word similarity approaches uses semantic resources, that is, hand-cra/ted thesauri (such as the Roget's thesaurus \[Chapman, 1984\] or Word-Net \[Miller et al., 1993\] in the case of English, and Bunruigoihyo \[National Language Research Institute, 1996\] or EDR \[EDR, 1995\] in the case of Japanese), based on the intuitively feasible assumption that words located near each other within the structure of a thesaurus have similar meaning. Therefore, the similarity between two given words is represented by the length of the path between them in the thesaurus structure \[Kurohashi and Nagao, 1994; Li et al., 1995; Uramoto, 1994\]. Unlike the former approach, the required memory space can be restricted to O(N) because only a list of semantic codes for each word is required. For example, the commonly used Japanese Bunrsigoihyo thesaurus \[National Language Research Institute, 1996\] represents each semantic code with only 8 digits. However, computationally speaking, the relation between the similarity (namely the semantic length of the path), and the physical length of the path is not clear 1. Furthermore, since most thesauri aim at a general word hierarchy, the similarity between words used in specific domains (technical terms) cannot be measured to the desired level of accuracy.</Paragraph> <Paragraph position="5"> IMost researchers heuristicallydefine functions between the similarity and physical path length \[Kurohashi and Nagao, 1994; Li et al., 1995; Uramoto, 1994\].</Paragraph> <Paragraph position="6"> In this paper, we aim at intergrating the advantages of the two above methodological types, or more precisely, realizing statistics-based word similarity based on the length of the thesaurus path. The crucial concern in this process is how to determine the statistics-based length of each branch in a thesanrus. We tentatively use the Bunruigoihyo thesaurus, in which each word corresponds to a leaf in the tree structure. Let us take figure 1, which shows a fragment of the thesaurus. In this figure, w,'s denote words and x,'s denote the statistics-based length (SBL, for short) of each branch i. Let the statistics-based (vector space model) word similarity between wl and w2 be vsm(wl, w2). We hope to estimate this similarity by the length of the path through branches 3 and 4, and derive an equation &quot;xs + x4 = sirn(wl, w2)&quot;. Intuitively speaking, any combination of xs and x4 which satisfies this equation can constitute the SBLs for branches 3 and 4. Formalizing equations for other pairs of words in the same manner, we can derive the simultaneous equation shown in figure 2. That is, we can assign the SBL for each branch by way of finding answers for each x~. This method is expected to excel in the following aspects.</Paragraph> <Paragraph position="7"> First, this method allows us to measure the statistics-based word similarity, while retaining the optimal required memory space (O(N)). One may argue that statistics-based automatic thesaurus construction (for example, the method proposed by Tokunaga et al. \[Tokunaga et al., 1995\]) can provide the same advantage, besides which there is no human overhead. However, it has been empirically observed that the topology of the structure (especially at higher levels) is not necessarily reasonable when based solely on statistics \[Frankes and Baeza-Yates, 1992\]. To avoid this problem, we would like to introduce hand-crafted thesauri into our framework because the topology (such as MAMMAL is a hyper class of HUMAN) allows for higher levels of sophistication based on human knowledge.</Paragraph> <Paragraph position="8"> Second, since each SBL reflects the statistics taken from co-occurrence data ~f the whole word set, statistics of each word can complement each other, and thus, the data sparseness problem tends to be minimized. Let us take figure 1 again, and assume that the statistics for w4 are sparse or completely missing. In previous statistics-based approaches, the similarity between w4 and other words cannot be reasonably measured, or not measured at all. However, in our method, similarity value such as vsm(wl, wa) can be reasonably measured because SBLs xl, x2 and x3 can be well-defined with sufficient statistics. null In section 2, we elaborate on the methodology of our word similarity measurement. We then evaluate our method by way of an experiment in section 3 and applied this method to the task of word sense disambiguation in section 4.</Paragraph> <Paragraph position="10"/> <Paragraph position="12"> sociated with figure 1</Paragraph> </Section> class="xml-element"></Paper>