File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0807_metho.xml
Size: 15,353 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0807"> <Title>Integration of Hand-Crafted and Statistical Resources in Measuring Word Similarity</Title> <Section position="4" start_page="45" end_page="45" type="metho"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 2.1 Overview </SectionTitle> <Paragraph position="0"> Our word similarity measurement proceeds in the follow- null ing way: 1. compute the statistics-based similarity of every combination of given words, 2. set up a simultaneous equation through use of the thesaurus and previously computed word similarity, and find solutions for the statistics-based length (SBL) of the corresponding thesaurus branch (see figures 1 and 2), 3. the similarity between two given words is measured by the sum of SBLs included in the path between those words.</Paragraph> <Paragraph position="1"> We will elaborate on each step in the following sections.</Paragraph> </Section> <Section position="2" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 2.2 Statistics-based word similarity </SectionTitle> <Paragraph position="0"> In the vector space model, each word w~ is represented by a vector comprising statistical factors of co-occurrence.</Paragraph> <Paragraph position="1"> This can be expressed by equation (1), where ~z is the vector for the word in question, and t,j is the co-occurrence statistics of w~ and w:.</Paragraph> <Paragraph position="2"> =< t,1, t,2, ..., t,j, ... > (1) With regard to t~3, we adopted TF.IDF, commonly used in information retrieval systems \[Frankes and Baeza-Yates, 1992\]. Based on this notion, t,~ is calculated as in equation (2), where \]~ is the frequency of w, collocating with w3, f3 is the frequency of w3, and T is the total number of collocations within the overall co-occurrence data.</Paragraph> <Paragraph position="4"> We then compute the similarity between words a and b bj the cosine of the angle between the two vectors g and b. This is realized by equation (3), where vsm is the similarity between a and b, based on the vector space model. ~.~ vsm(a, b) = i~llgl (3) It should be noted that our framework is independent of the implementation of the similarity computation, which has been variously proposed by different researchers \[Charniak, 1993; Frankes and Baeza-Yates, 1992\].</Paragraph> </Section> <Section position="3" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 2.3 Resolution of the simultaneous equa- </SectionTitle> <Paragraph position="0"> tion The simultaneous equation used in our method is expressed by equation (4), where A is a matrix comprising only the values 0 and 1, and B is a list ofvsm's (see equation (3)) for any possible combinations of given words. X is a list of variables, which represents the statistics-based length (SBL) for the corresponding branch in the thesaurus.</Paragraph> <Paragraph position="2"> Here, let the i-th similarity in B be vsm(a,b), and let path(a, b) denote the path between words a and b in the thesaurus. Each equation contained in the simultaneous equation is represented by equation (5), where x~ is the statistics-based length (SBL) for branch 3, and a, 3 is either 0 or 1 as in equation (6).</Paragraph> <Paragraph position="4"> By finding the solutions for X, we can assign SBLs to branches. However, the set of similarity values outnumbers the variables. For example, the Bunruigoihyo thesaurus contains about 55,000 noun entries, and therefore, the number of similarity values for those nouns becomes about 1.5x109 (ss,000C2). On the other hand, the number of the branches is only about 53,000. As such, overly many equations are redundant, and the time complexity to solve the simultaneous equation becomes a crucial problem. To counter this problem, we randomly divide the overall equation set into equal parts, which can be solved reasonably. Thereafter we approximate the solution for x by averaging the solutions for x derived from each subset. Let us take figure 3, in which the number of subsets is given as two without loss of generality. In this figure, x,1 and x~2 denote the answers for branch i individually derived from subsets 1 and 2, and x~ is approximated by the average of xzl and x,2 (that is, x,l+x,2 ~ To generalize this notion, let x,j denote the 2 /&quot; solution associated with branch i in subset j. The approximate solution for branch i is given by equation (7), where n is the number of divisions of the equation set.</Paragraph> <Paragraph position="6"/> </Section> <Section position="4" start_page="45" end_page="45" type="sub_section"> <SectionTitle> Xs 2.4 Word similarity using SBL </SectionTitle> <Paragraph position="0"> Let us reconsider figure 1. In this figure, the similarity between Wl and w2, for example, is measured by the sum of x3 and x4. In general, the similarity between words a and b using SBL (sbl(a, b), hereafter) is realized by equation (8), where x~ is the SBL for branch i, and path(a, b) is the path that includes thesaurus branches located between a and b.</Paragraph> <Paragraph position="2"/> </Section> </Section> <Section position="5" start_page="45" end_page="75" type="metho"> <SectionTitle> 3 Experimentation </SectionTitle> <Paragraph position="0"> We conducted experiments on noun entries in the Bunruigoihyo thesaurus. Co-occurrence data was extracted from the RWC text base RWC-DB-TEXT-95-1 \[Real World Computing Partnership, 1995\]. This text base consists of 4 years worth of Mainichi Shimbun \[Mainichi Shimbun, 1991-1994\] newspaper articles, which were automatically annotated with morphological tags. The total number of morphemes is about 100 million. Instead of conducting full parsing on the texts, several heuristics were used in order to obtain dependencies between nouns and verbs in the form of tuples (frequency, noun, postposition, verb). Among these tuples, only those which included the postposition wo (typically marking the accusative case) were used. Further, tuples with nouns appearing in the Bunruigoihyo thesaurus were selected. When the noun comprised a compound noun, it was transformed into the maximal leftmost substring contained in the Bunruigoihyo thesaurus. As a result, 419,132 tuples remained, consisting of 23,223 noun types and 9,151 verb types. In regard to resolving the simultaneous equations, we used the mathematical analysis tool ,,MATLAB ,,2.</Paragraph> <Paragraph position="1"> What we evaluated here is the degree to which the simultaneous equation was successfully approximated through the use of the technique described in section 2. In other words, to what extent the (original) statistics-based word similarity can be realized by our framework. We conducted this evaluation in the following way. Let the statistics-based similarity between words a and b be vsm(a,b), and the similarity based on SBL be sbl(a, b). Here, let us assume the inequality &quot;vsm(a, b) > vsm(c, d)&quot; for words a, b, c and d. If this inequality can be maintained for our method, that is, &quot;sbl(a, b) > sbl(c, d)&quot;, the similarity measurement is taken to be successful. The accuracy is then estimated by the ratio between the number of successful measurements and the total number of trials. Since resolution of equations is time-consuming, we tentatively generalized 23,223 nouns into 303 semantic classes (represented by the first 4 digits of the semantic code given in the Bunruigoihyo thesaurus), reducing the total number of equations to 45,753. Figure 4 shows the relation between the number of equations used and the accuracy: we divided the overall equation set into n equal sub-sets 3 (see section 2.3), and progressively increased the number of subsets used in the computation. When the whole set of equations was provided, the accuracy became about 72%. We also estimated the lower bound of this evaluation, that is, we also conducted the same trials using the Bunruigoihyo thesaurus. In this case, if word a is more closely located to b than c is to d and &quot;vsm(a, b) > vsm(c,d)&quot;, that trial measurement is taken to be successful. We found that the lower bound was roughly 56%, and therefore, our framework outperformed this method.</Paragraph> </Section> <Section position="6" start_page="75" end_page="75" type="metho"> <SectionTitle> 4 An application </SectionTitle> <Paragraph position="0"> We further evaluated our word similarity technique in the task of word sense disambiguation (WSD). In this task, the system is inputted with sentences containing sense ambiguous words, and interprets them by choosing the most plausible meaning for them based on the context 4. The WSD technique used in this paper has been proposed by Kurohashi et al. \[Kurohashi and Nagao, 1994\] and enhanced by Fujii et al. \[Fujii et al., 1996\], and disambiguates Japanese sense ambiguous verbs by use of an example-database 5. Figure 5 shows a fragment of the database associated with the Japanese verb tsukau, some of which senses are &quot;to employ&quot;, &quot;to operate&quot; and &quot;to spend&quot;. The database specifies the case frame(s) associated with each verb sense. In Japanese, a complement of a verb consists of a noun phrase (case filler) and its case marker suffix, for example ga (nominative), ni (dative) or wo (accusative). The database lists several case filler examples for each case. Given an input, the system identifies the verb sense on the basis of the similarity between the input and examples for each verb sense contained in the database. Let us take the following input: enjinia ga fakkusu wo tsukau.</Paragraph> <Paragraph position="2"> In this example, one may consider enjinia (&quot;engineer&quot;) and \]akkusu (&quot;facsimile&quot;) to be semantically similar to 4In most WSD systems, candidates of word sense are predefined in a dictionary.</Paragraph> <Paragraph position="3"> SThere have been different approaches proposed for this task, based on statistics \[Charniak, 1993\].</Paragraph> <Paragraph position="4"> gakusei (&quot;student&quot;) and konpyuutaa (&quot;computer&quot;), respectively, from the &quot;to operate&quot; sense of tsukau. As a result, tsukau is interpreted as &quot;to operate&quot;. To forrealize this notion, the system computes the plausibility score for each verb sense candidate, and chooses the sense that maximizes the score. The score is computed by considering the weighted average of the similarity of the input case fillers with respect to each of the corresponding example case fillers listed in the database for the sense under evaluation. Formally, this is expressed by equation (9), where Score(s) is the score for verb sense s. nc denotes the case filler for case c, and gs,e denotes a set of case filler examples for each case c of sense s (for example, PS = {kate, kigyou} for the ga case in the &quot;to employ&quot; sense in figure 5). sim(nc, e) stands for the similarity between nc and an example case filler</Paragraph> <Paragraph position="6"> CCD(c) expresses the weight factor of case c using the notion of case contribution to verb sense disambiguation (CCD) proposed by Fujii et al \[Fujii et al., 1996\]. Intuitively, the CCD of a case becomes greater when example sets of the case fillers are disjunctive over different verb senses. In the case fillers of figure 5, for example, CCD(ACC) is greater than CCD(NOM) (see Fujii et al's paper for details).</Paragraph> <Paragraph position="7"> One may notice that the critical content of this task is the computation of the similarity between case fillers (nouns) in equation (9). This is exactly where our word similarity measurement can be applied. In this experiment, we compared the following three methods for word similarity measure: * the Bunruigoihyo thesaurus (BGH): the similarity between case fillers is measured by a function between the length of the path and the similarity. In this experiment, we used the function proposed by Kurohashi et ai. \[Kurohashi and Nagao, 1994\] as shown in table 1.</Paragraph> <Paragraph position="8"> . vector space model (VSM): we replace s~m(nc, e) equation (9) with vsm(nc, e) computed by equation (3) * our method base on statistics-based length (SBL): we simply replace sim(nc, e) in equation (9) with sbl(nc, e) computed by equation (8).</Paragraph> <Paragraph position="9"> We collected sentences (as test/training data) from the EDR Japanese corpus \[EDR, 1995\] ~. Since Japanese sentences have no lexical segmentation, the input has to be both morphologically and syntactically analyzed prior to the sense disambiguation process. We experimentally used the Japanese morph/syntax parser eThe EDR corpus was originally collected from news articles. between two nouns nl and n2 in the Bunruigoihyo thesaurus (len(nl, n2)) and their similarity (szm(nl, n2))</Paragraph> <Paragraph position="11"> &quot;QJP&quot; \[Kameda, 1996\] for this process. Based on analysis by the QJP parser, we removed sentences with missing verb complements (in most cases, due to ellipsis or zero aaaphora). The EDR corpus also provides sense information for each The EDIt corpus provides sense information for each word based on the EDIt dictionary, which we used as a means of checking the correct interpretation. Our derived corpus contains ten verbs frequently appearing in the EDIt corpus, which are summarized in table 2. In table 2, the column of &quot;English gloss&quot; describes typical English translations of the Japanese verbs, the column of &quot;# of sentences&quot; denotes the number of sentences in the corpus, while &quot;# of senses&quot; denotes the number of verb senses, based on the EDIt dictionary. For each of the ten verbs, we conducted four-fold cross validation: that is, we divided the corpus into four equal parts, and conducted four trials, in each of which a different one of the four parts was used as test data and the remaining parts were used as training data (the database). Table 2 also shows the precision of each method. The precision is the ratio of the number of correct interpretations, to the number of outputs. The column of &quot;control&quot; denotes the precision of a naive WSD technique, in which the system systematicaily chooses the verb sense appearing most frequently in the database \[Gale et al., 1992\].</Paragraph> <Paragraph position="12"> The precision for each similarity calculation method did not differ greatly, and the use of the length of the path in the Bunruigoihyo thesaurus (BGH) slightly out-performed other method on the whole. However, since the overall precision is biased by frequently appeared verbs (such as tsukau and ukeru), our word similarity measurement is not necessarily inferior to other methods. In fact, disambiguation of verbs such as mo~orneru, in which BGH is surpassed by VSM, SBL maintains a precision level relatively equivalent to that for VSM. Besides this, as we pointed out in section 1, SBL allows us to reduce the data size from O(N 2) to O(N) in our framework, given that N is the number of word entries.</Paragraph> </Section> class="xml-element"></Paper>