File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0903_metho.xml
Size: 17,950 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0903"> <Title>Dual Distributional Verb Sense Disambiguation with Small Corpora and Machine Readable Dictionaries*</Title> <Section position="3" start_page="17" end_page="17" type="metho"> <SectionTitle> 2 Dual Distributional Similarity </SectionTitle> <Paragraph position="0"> We attempt to have the system learn to disa biguate the appearances of a polysemous verb w its senses defined in a dictionary using the , occurrences of syntactically related words in a P( tagged corpus. We consider two major word class V and N, for the verbs and nouns and a single re tion between them, in our experiments the relati between a transitive main verb and the head no of its direct object. Thus, a noun is represented a vector of verbs that takes the noun as its object, and a verbs by a vector of nouns that appears as the verb's object. Commonly used corpus-based models depend on co-occurrence patterns of words to determine similarity. If word wl's co-occurrence patterns is similar to word w2's patterns, then wl is similar to w2 contextually. Note that contextually similar words do not have to be synonym, or to belong to the same semantic category. We define a word being computed the similarity as a target word and a word occurring in the co-occurrence pattern of the target word as a co-occurred word. The overlap of words between co-occurrence patterns of two target words determines the similarity of them. However, in case of small training corpus, it is difficult to confide in the similarity depending on statistics of co-occurrences. The reason is that when two words have no overlap of co-occurrence patterns, we can not discriminate whether two words are not similar or it fails to find the similarity due to sparse data To distinguish two cases, we expand the co-occurrences of the target word to the co-occurrences of the co-occurred words with the target word. According to the co-occurrence patterns of the co-occurred words, it is possible to cluster the co-occurred words roughly. And we can overcome the problem of data sparseness by applied not co-occurred words but co-occurred clusters to the similarity of target words.</Paragraph> <Paragraph position="1"> A dual distributional similarity is an extension to word similarity measure reflecting the distributions of the co-occurred words with the target word as well as the distribution of the target word.</Paragraph> <Paragraph position="2"> target words co-occun-cd words co-occurredwith words onfe~renc~ co-occurred words distributional similarity, in comparison with the unitary distributional similarity. The simple comparison with co-occurrence patterns of conference and meeting fails to find the similarity between the two nouns because there in no overlap in the co-occurrence patterns. However, dual distributional similarity measure can be find that the two nouns are similar even if the co-occurrence patterns of the two nouns do not overlap. First, since the co-occurred verbs attend, end, hold, and start with conference and meeting share several objects such as event, reply, and party, we can find that the co-occurred verbs are similar. And since conference and meeting share similar verbs, they are similar even if they do not share any verbs.</Paragraph> </Section> <Section position="4" start_page="17" end_page="118" type="metho"> <SectionTitle> 3 The WSD System Using a Corpus </SectionTitle> <Paragraph position="0"> and a MRD The architecture of the WSD system using a corpus and a MRD is given in Figure 2. Our system consists of two parts, which are the knowledge acquisition system and the sense disambiguation system. The knowledge acquisition system also consists of two parts, one of the acquisition of selectional restriction examples from a POS-tagged corpus and another of the acquisition of each verb's sense indicators and noun clustering cues from a MRD. The sense disambiguation system assigns an appropriate sense to an ambiguous verb by computation of similarity between its object in a sentence and its sense indicators. The overall process for verb sense disambiguation is as follows: * Extract all selectional restriction examples from a POS-tagged corpus.</Paragraph> <Paragraph position="1"> MRD * Extract each polysemous verb's sense indicators from a MRD.</Paragraph> <Paragraph position="2"> * For a target verb, compute similarities between its object and its sense indicators using the se- null lectional restriction examples acquired from the corpus and clustering cues from the MRD.</Paragraph> <Paragraph position="3"> * Determine the sense of the most similar sense indicator as the verb's disambiguated sense.</Paragraph> <Section position="1" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 3.1 Context for verb sense disambiguation </SectionTitle> <Paragraph position="0"> Presumably verbs differ in their selectional restrictions because the different actions they denote are normally performed with different objects. Thus we can distinguish verb senses by distinguishing selectional restrictions. (Yarowsky, 1993) determined various disambiguating behaviors based on syntactic category; for example, that verbs derive more disambiguating information from their objects than from their subjects, and adjectives derive almost all disambiguating information from nouns they modify.</Paragraph> <Paragraph position="1"> We use verb-object relation for verb sense disambiguation. For example, consider the sentences Susan opened the meeting and Susan opened the door.</Paragraph> <Paragraph position="2"> In deciding which open's senses in Table 1 are tagged in the two sentences, the fact that meeting and door appear as the direct object of open respectively gives some strong evidence.</Paragraph> </Section> <Section position="2" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 3.2 Lexical knowledge acquisition </SectionTitle> <Paragraph position="0"> In previous works using MRDs for word sense disambiguation, the words in definition texts are used as sense indicators. However, the MRD definitions alone do not contain enough information to allow reliable disambiguation. To overcome this problem, we use the MRD usage examples as the sense-tagged examples as well as definitions for acquiring sense indicators. We acquire all objects in the MRD definitions and usage examples of a polysemous verb as its sense indicators. We use objects as sense indicators by same reason of using verb-object selection relation for verb sense disambiguation. These sense indicators is very useful to verb sense disambiguation because the objects in usage examples are typical and very often used with the sense of the verb.</Paragraph> <Paragraph position="1"> The entries of wear in OALD and ipta (wear) and ssuta (write) in Korean dictionary and the sense indicator sets acquired from them are shown in Table 2.</Paragraph> <Paragraph position="2"> We acquire another information from the dictionary definition. Dictionary definitions of nouns are normally written in such a way that one can identify for each headword (the word being defined), a &quot;genus term&quot; (a word more general that the headword), and these are related via an IS-A relation(Bruce,1992; Klavans, 1990; Richardson,1997).</Paragraph> <Paragraph position="3"> We use the IS-A relation as noun clustering cues.</Paragraph> <Paragraph position="4"> For example, consider the following definitions in OALD.</Paragraph> <Paragraph position="5"> hat covering for the head with a brim, worn out of doors. cap 1 soft covering for the head without a brim. bonnet. shoe 1 covering for the foot, esp. one that does not reach above the ankle.</Paragraph> <Paragraph position="6"> Here covering is common genus term of the headwords, hat, cap 1, and shoe 2. That is, we can say that &quot;hat IS-A covering&quot;, &quot;cap I IS-A covering&quot;, and &quot;shoe 2 IS-A covering&quot;, and determine these three nouns as same cluster covering. In cap l's definition, bonnet is a synonym of cap 1. We also use the synonyms of a headword as another clustering cues.</Paragraph> <Paragraph position="7"> Our mechanism for finding the genus terms is based on the observation that in Korean dictionary, the genus term is typically the tail noun of the defining phrase as follows: ilki nalmata kyekkun il, sayngkakul cekun kilok(record).</Paragraph> <Paragraph position="8"> (diary) (daily record of events, thoughts, etc.) Because these clustering cues are not complete and consistent, we use parent and sibling clusters without multi-step inference for acquired IS-A relations. We acquire word co-occurrences within syntactic relations for learning word similarity from a POS-tagged corpus in Korean. To acquire word co-occurrences within syntactic relations, we have to get the required parsing information. Postpositions in Korean are used to mark the syntactic relations of the preceding head components in a sentence. For example, the postpositions ka and i usually mark the subjective relation and ul and lul the objective relation. 1 Given the sentence 2 kunye-ka(she) phyenci-lul(letter) ssu-ta(write), we can know that kunye (she) is the subject head and phyenci (letter) is the direct object head according to the postpositions ka and lul. We call guessing the syntactic relation by postpositions as Postposition for Syntactic Relation heuristic. When there are multiple verbs in a sentence, we should determine one verb in relation to the object component. In such attachment ambiguity, we apply the Left Association heuristic, corresponding to the Right Association in English. This heuristic states that the object component prefers to be attached to the leftmost verb in Korean. With the two heuristics, we can accurately acquire word co-occurrences within syntactic relations from the POS-tagged corpus without parsing(Cho, 1997).</Paragraph> </Section> <Section position="3" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 3.3 Dual distributional sense </SectionTitle> <Paragraph position="0"> disambiguation In our system, verb sense disambiguation is the clustering of an ambiguous verb's objects using its sense indicators as seeds. As noted above, a noun is represented by a vector of verbs that takes the noun as its object, and a verbs by a vector of nouns that appears as the verb's object. We call the former a noun distribution and the latter a verb distribution. The noun distribution is probabilities of how often each verb had the noun as object, given the noun as object, that the verb is vl,v2,...vwi.</Paragraph> <Paragraph position="2"> where I VI is the number of verbs used as transitive verb in training corpus, and freq(v,n) is the frequency of verb v that takes noun n as direct object.</Paragraph> <Paragraph position="3"> A verb distribution is a vector of nouns that appears as the verb's direct object. We define the verb</Paragraph> <Paragraph position="5"> where IN\[ is the number of nouns appeared as transitive verb's direct object.</Paragraph> <Paragraph position="6"> The process of object clustering is as follows: 1. Cluster the objects according to clustering cues acquired from the MRD.</Paragraph> <Paragraph position="7"> 2. Cluster the objects excepted from Step 1 using the dual distribution.</Paragraph> <Paragraph position="8"> 3. Cluster the objects excepted from Step 2 to the MRD's first sense of the polysemous verb. implicit in MRD definition We define cluster cluster(w) and synonym set synonym(w) of a word w using IS-A relations implicit in the MRD definition. The criteria of clustering word wl and word w2 as same cluster are as follows:</Paragraph> <Paragraph position="10"> To compute the similarities between nouns we use the relative entropy or Kullback-Leibler(KL) distance as metric to compare two noun distributions. The relative entropy is an information-theoretic measure of how two probability distributions differ. Given two probability distributions p and q, their relative entropy is defined as where we define Ologq deg- = 0 and otherwise plogo~ = c~. This quantity is always non-negative, and D(pllq ) = 0 iff p = q. Note that relative entropy is not a metric (in the sense in which the term is used in mathematics): it is not symmetric in p and q, and it does not satisfy a triangle equality. Nevertheless, informally, the relative entropy is used as the &quot;distance&quot; between two probability distribution in many previous works(Pereira, 1993; Resnik, 1997). The relative entropy can be applied straightforwardly to the probabilistic treatment of selectional restriction. As noted above, the noun distribution d(n) is verb vi's condition probability given by noun n. Given two noun distributions d(n:) and d(n2), the similarity between them is quantified as:</Paragraph> <Paragraph position="12"> The noun distributions p and q is easy to have zero probabilities by the problem of sparse data with small training corpus 3. In such case, the similarity of the distributions is not reliable because of Ologq deg- = 0 and plogo~ = co. This can be known from the results of sense disambiguation experiments using only noun distributions (see Section 4.2). The verb distributions play complementary roles when the noun distributions have zero probabilities. For all verbs where p(viln2) = 0 and p(vilnl) > 0 or the reverse case: 1. execute OR operation with all distributions for the verbs vi where p(v~ In2) = 0 and p(vilnl) > 0 in the noun distribution d(n:) and make new distribution, dVl.</Paragraph> <Paragraph position="13"> dv, = V d(vi), for p(vilnl ) > 0 and p(viln2 ) = 0 . execute OR operation with all distributions for the verbs vi where p(vitn2) > 0 and p(vilnl) = 0 in the noun distribution d(n2) and make new distribution, dv2.</Paragraph> <Paragraph position="15"> We use a stop verb list to discard from Steps 1 and 2 verbs taken too many nouns as objects, such even in a large corpus(Church, 1993), actually the noun distributions have not many common verbs.</Paragraph> <Paragraph position="16"> as hata (do), which do not contribute to the disambiguation process. The verb distribution has the binary values, 1 or 0 according to its object distributions in the training corpus. Thus, the inner product Dverb(d(Vl), d(v2)) with dv: and dv2 means the number of common objects to two distributions. We can compute the similarities of the co-occurred verbs in the two noun distributions with the number of common objects. Although the two noun distribution do not share any verbs, if they have similar verbs in common, they are similar.</Paragraph> <Paragraph position="17"> Combining similarities of noun distributions and verb distributions, we compute total similarity between the noun distributions.</Paragraph> <Paragraph position="19"> The a,/3 are the experimental constants(0.71 for cr and 0.29 for/3).</Paragraph> </Section> </Section> <Section position="5" start_page="118" end_page="118" type="metho"> <SectionTitle> 4 Experimental Evaluation </SectionTitle> <Paragraph position="0"> We used the KAIST corpus, which contains 573,193 eojeols 4 and is considered a small corpus for the present task. As the dictionary, we used the Grand Korean Dictionary, which contains 144,532 entries.</Paragraph> <Paragraph position="1"> The system was tested on a total of 948 examples of 10 polysemous verbs extracted from the corpus: kamta, kelta, tayta, tulta, ttaluta, ssuta, chita, thata, phwulta, and phiwuta (although we confined the test to transitive verbs, the system is applicable to intransitive verbs or adjectives). For this set of verbs, the average number of senses per verb is 6.7. We selected the test verbs considering the frequencies in the corpus, the number of senses in the dictionary, and the usage rates of each sense.</Paragraph> <Paragraph position="2"> We tested the systems on two test sets from KAIST corpus. The first set, named C23, consists of 229,782 eojeols and the second set, named C57, consists of 573,193 eojeols. The experimental results obtained are tabulated in Table 3. As a base-line against which to compare results we computed the percentage of words which are correctly disambiguated if we chose the most frequently occurring sense in the training corpus for each verb, which resulted in 42.4% correct disambiguation. Columns 35 illustrate the effect of adding the dual distribution and the MRD information. When the dual distribution is used, we can see significant improvements of about 22% for recall and about 12% for the precision. Specially, in smaller corpus (C23), the improvement of recall is remarkable as 25%. This represents that the dual distribution is effective to overcome the problem of sparse data, especially for small corpus. Moreover, by using both the dual distribution and the MRD information, our system achieved 4Eojeol is the smallest meaningful unit consisting of content words (nouns, verbs, adjectives, etc.) and functional words (postpositions, auxiliaries, etc.) measure corpus noun dis. dual dis. dual dis.</Paragraph> <Paragraph position="3"> the improvements of about 16% for recall and about 25% for the precision.</Paragraph> <Paragraph position="4"> The average performance of our system is 86.3% and this is a little behind comparing with other previous work's performance in English. Most previous works have reported the results in &quot;70%-92%&quot; accuracies for particular words. However, our system is the unsupervised learning with small POS-tagged corpus,and we do not restrict the word's sense set within either binary senses(Yarowsky,1995; Karov, 1998) or dictionary's homograph level(Wilks, 1997).</Paragraph> <Paragraph position="5"> Thus, our system is appropriate for practical WSD system as well as bootstrapping WSD system starting with small corpus.</Paragraph> </Section> class="xml-element"></Paper>