File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1016_metho.xml
Size: 20,271 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1016"> <Title>Determining Word Sense Dominance Using a Thesaurus</Title> <Section position="4" start_page="122" end_page="123" type="metho"> <SectionTitle> 3 Co-occurrence Information </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="122" end_page="122" type="sub_section"> <SectionTitle> 3.1 Word-Category Co-occurrence Matrix </SectionTitle> <Paragraph position="0"> The strength of association between a particular category of the target word and its co-occurring words can be very useful--calculating word sense dominance being just one application. To this end we create the word-category co-occurrence matrix (WCCM) in which one dimension is the list of all words (w1BNw2BNBMBMBM) in the vocabulary, and the other dimension is a list of all categories</Paragraph> <Paragraph position="2"> A particular cell, mij, pertaining to word wi and category cj, is the number of times wi occurs in a predetermined window around any c-term of cj in a text corpus. We will refer to this particular WCCM created after the first pass over the text as the base WCCM. A contingency table for any particular word w and category c (see below) can be easily generated from the WCCM by collapsing cells for all other words and categories into one and summing up their frequencies. The application of a suitable statistic will then yield the strength of association between the word and the category.</Paragraph> <Paragraph position="4"> Even though the base WCCM is created from unannotated text, and so is expected to be noisy, we argue that it captures strong associations reasonably accurately. This is because the errors in determining the true category that a word co-occurs with will be distributed thinly across a number of other categories (details in Section 3.2).</Paragraph> <Paragraph position="5"> Therefore, we can take a second pass over the corpus and determine the intended sense of each word using the word-category co-occurrence frequency (from the base WCCM) as evidence. We can thus create a newer, more accurate, bootstrapped WCCM by populating it just as mentioned earlier, except that this time counts of only the co-occurring word and the disambiguated category are incremented. The steps of word sense disambiguation and creating new bootstrapped WCCMs can be repeated until the bootstrapping fails to improve accuracy significantly.</Paragraph> <Paragraph position="6"> The cells of the WCCM are populated using a large untagged corpus (usually different from the target text) which we will call the auxiliary corpus. In our experiments we use a subset (all except every twelfth sentence) of the British National Corpus World Edition (BNC) (Burnard, 2000) as the auxiliary corpus and a window size of A65 words. The remaining one twelfth of the BNC is used for evaluation purposes. Note that if the target text belongs to a particular domain, then the creation of the WCCM from an auxiliary text of the same domain is expected to give better results than the use of a domain-free text.</Paragraph> </Section> <Section position="2" start_page="122" end_page="123" type="sub_section"> <SectionTitle> 3.2 Analysis of the Base WCCM </SectionTitle> <Paragraph position="0"> The use of untagged data for the creation of the base WCCM means that words that do not really co-occur with a certain category but rather do so with a homographic word used in a different sense will (erroneously) increment the counts corresponding to the category. Nevertheless, the strength of association, calculated from the base WCCM, of words that truly and strongly co-occur with a certain category will be reasonably accurate despite this noise.</Paragraph> <Paragraph position="1"> We demonstrate this through an example. Assume that category c has 100 c-terms and each c-term has 4 senses, only one of which corresponds to c while the rest are randomly distributed among other categories. Further, let there be 5 sentences each in the auxiliary text corresponding to every c-term-sense pair. If the window size is the complete sentence, then words in 2,000 sentences will increment co-occurrence counts for c. Observe that 500 of these sentences truly correspond to category c, while the other 1500 pertain to about 300 other categories. Thus on average 5 sentences correspond to each category other than c. Therefore in the 2000 sentences, words that truly co-occur with c will likely occur a large number of times, while the rest will be spread out thinly over 300 or so other categories.</Paragraph> <Paragraph position="2"> We therefore claim that the application of a suitable statistic, such as odds ratio, will result in significantly large association values for word-category pairs where the word truly and strongly co-occurs with the category, and the effect of noise will be insignificant. The word-category pairs having low strength of association will likely be adversely affected by the noise, since the amount of noise may be comparable to the actual strength of association. In most natural language applications, the strength of association is evidence for a particular proposition. In that case, even if association values from all pairs are used, evidence from less-reliable, low-strength pairs will contribute little to the final cumulative evidence, as compared to more-reliable, high-strength pairs. Thus even if the base WCCM is less accurate when generated from untagged text, it can still be used to provide association values suitable for most natural language applications. Experiments to be described in section 6 below substantiate this.</Paragraph> </Section> <Section position="3" start_page="123" end_page="123" type="sub_section"> <SectionTitle> 3.3 Measures of Association </SectionTitle> <Paragraph position="0"> The strength of association between a sense or category of the target word and its co-occurring words may be determined by applying a suitable statistic on the corresponding contingency table.</Paragraph> <Paragraph position="1"> Association values are calculated from observed perimental results using Dice coefficient (Dice), cosine (cos), pointwise mutual information (pmi), odds ratio (odds), Yule's coefficient of colligation (Yule), and phi coefficient (ph)1.</Paragraph> </Section> </Section> <Section position="5" start_page="123" end_page="123" type="metho"> <SectionTitle> 4 Word Sense Dominance </SectionTitle> <Paragraph position="0"> We examine each occurrence of the target word in a given untagged target text to determine dominance of any of its senses. For each occurrence tBC of a target word t, let TBC be the set of words (tokens) co-occurring within a predetermined window around tBC; let T be the union of all such TBC and let CGt be the set of all such TBC. (Thus CYCGtCY is equal to the number of occurrences of t, and CYTCY is equal to the total number of words (tokens) in the windows around occurrences of t.) We describe BNW is based on the assumption that the more dominant a particular sense is, the greater the strength of its association with words that co-occur with it. For example, if most occurrences of bank in the target text correspond to 'river bank', then the strength of association of 'river bank' with all of bank's co-occurring words will be larger than the sum for any other sense. Dominance DI where A is any one of the measures of association from section 3.3. Metaphorically, words that co-occur with the target word give a weighted vote to each of its senses. The weight is proportional to the strength of association between the sense and the co-occurring word. The dominance of a sense is the ratio of the total votes it gets to the sum of votes received by all the senses.</Paragraph> <Paragraph position="1"> A slightly different assumption is that the more dominant a particular sense is, the greater the number of co-occurring words having highest strength of association with that sense (as opposed to any other). This leads to the following methodology. Each co-occurring word casts an equal, unweighted vote. It votes for that sense (and no other) of the target word with which it has the highest strength of association. The dominance</Paragraph> </Section> <Section position="6" start_page="123" end_page="124" type="metho"> <SectionTitle> DI </SectionTitle> <Paragraph position="0"> BNU of the sense is the ratio of the votes it gets to the total votes cast for the word (number of co-occurring words).</Paragraph> <Paragraph position="1"> the senses of the target word's occurrences. We now describe alternative approaches that may be used for explicit sense disambiguation of the target word's occurrences and thereby determine sense dominance (the proportion of occurrences of that sense). DE BNW relies on the hypothesis that the intended sense of any occurrence of the target word has highest strength of association with its co-occurring words.</Paragraph> <Paragraph position="3"> Metaphorically, words that co-occur with the target word give a weighted vote to each of its senses just as in DI BNW . However, votes from co-occurring words in an occurrence are summed to determine the intended sense (sense with the most votes) of the target word. The process is repeated for all occurrences that have the target word. If each word that co-occurs with the target word votes as described for DI BNU, then the following hypothesis forms the basis of DE BNU: in a particular occurrence, the sense that gets the maximum votes from its neighbors is the intended sense.</Paragraph> <Paragraph position="5"> BNW and DEBNU, the dominance of a sense is the proportion of occurrences of that sense.</Paragraph> <Paragraph position="6"> The degree of dominance provided by all four methods has the following properties: (i) The dominance values are in the range 0 to 1--a score of 0 implies lowest possible dominance, while a score of 1 means that the dominance is highest. (ii) The dominance values for all the senses of a word sum to 1.</Paragraph> </Section> <Section position="7" start_page="124" end_page="124" type="metho"> <SectionTitle> 5 Pseudo-Thesaurus-Sense-Tagged Data </SectionTitle> <Paragraph position="0"> To evaluate the four dominance methods we would ideally like sentences with target words annotated with senses from the thesaurus. Since human annotation is both expensive and time intensive, we present an alternative approach of artificially generating thesaurus-sense-tagged data following the ideas of Leacock et al. (1998). Around 63,700 of the 98,000 word types in the Macquarie Thesaurus are monosemous--listed under just one of the 812 categories. This means that on average around 77 c-terms per category are monosemous. Pseudo-thesaurus-sense-tagged (PTST) data for a non-monosemous target word t (for example, brilliant) used in a particular sense or category c of the thesaurus (for example, 'intelligence') may be generated as follows. Identify monosemous c-terms (for example, clever) belonging to the same category as c. Pick sentences containing the monosemous c-terms from an untagged auxiliary text corpus.</Paragraph> <Paragraph position="1"> Hermione had a clever plan.</Paragraph> <Paragraph position="2"> In each such sentence, replace the monosemous word with the target word t. In theory the c-terms in a thesaurus are near-synonyms or at least strongly related words, making the replacement of one by another acceptable. For the sentence above, we replace clever with brilliant. This results in (artificial) sentences with the target word used in a sense corresponding to the desired category.</Paragraph> <Paragraph position="3"> Clearly, many of these sentences will not be linguistically well formed, but the non-monosemous c-term used in a particular sense is likely to have similar co-occurring words as the monosemous c-term of the same category.2 This justifies the use of these pseudo-thesaurus-sense-tagged data for the purpose of evaluation.</Paragraph> <Paragraph position="4"> We generated PTST test data for the head words in SENSEVAL-1 English lexical sample space3 using the Macquarie Thesaurus and the held out sub-set of the BNC (every twelfth sentence).</Paragraph> </Section> <Section position="8" start_page="124" end_page="126" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> We evaluate the four dominance methods, like McCarthy et al. (2004), through the accuracy of a naive sense disambiguation system that always gives out the predominant sense of the target word.</Paragraph> <Paragraph position="1"> In our experiments, the predominant sense is determined by each of the four dominance methods, individually. We used the following setup to study the effect of sense distribution on performance.</Paragraph> <Paragraph position="2"> 2Strong collocations are an exception to this, and their effect must be countered by considering larger window sizes. Therefore, we do not use a window size of just one or two words on either side of the target word, but rather windows of A65 words in our experiments.</Paragraph> <Paragraph position="3"> 3SENSEVAL-1 head words have a wide range of possible senses, and availability of alternative sense-tagged data may be exploited in the future.</Paragraph> <Section position="1" start_page="125" end_page="125" type="sub_section"> <SectionTitle> 6.1 Setup </SectionTitle> <Paragraph position="0"> For each target word for which we have PTST data, the two most dominant senses are identified, say s1 and s2. If the number of sentences annotated with s1 and s2 is x and y, respectively, where x BQy, then all y sentences of s2 and the first y sentences of s1 are placed in a data bin. Eventually the bin contains an equal number of PTST sentences for the two most dominant senses of each target word.</Paragraph> <Paragraph position="1"> Our data bin contained 17,446 sentences for 27 nouns, verbs, and adjectives. We then generate different test data sets da from the bin, where a takes values 0BNBM1BNBM2BNBMBMBMBN1, such that the fraction of sentences annotated with s1 is a and those with s2 is 1A0a. Thus the data sets have different dominance values even though they have the same number of sentences--half as many in the bin.</Paragraph> <Paragraph position="2"> Each data set da is given as input to the naive sense disambiguation system. If the predominant sense is correctly identified for all target words, then the system will achieve highest accuracy, whereas if it is falsely determined for all target words, then the system achieves the lowest accuracy. The value of a determines this upper bound and lower bound. If a is close to 0BM5, then even if the system correctly identifies the predominant sense, the naive disambiguation system cannot achieve accuracies much higher than 50%. On the other hand, if a is close to 0 or 1, then the system may achieve accuracies close to 100%. A disambiguation system that randomly chooses one of the two possible senses for each occurrence of the target word will act as the baseline. Note that no matter what the distribution of the two senses (a), this system will get an accuracy of 50%.</Paragraph> <Paragraph position="3"> DI,W (odds), base: .08E,W(odds), bootstrapped: .02D Mean distance below upper bound</Paragraph> </Section> <Section position="2" start_page="125" end_page="126" type="sub_section"> <SectionTitle> 6.2 Results </SectionTitle> <Paragraph position="0"> Highest accuracies achieved using the four dominance methods and the measures of association that worked best with each are shown in Figure 4.</Paragraph> <Paragraph position="1"> The table below the figure shows mean distance below upper bound (MDUB) for all a values considered. Measures that perform almost identically are grouped together and the MDUB values listed are averages. The window size used was A65 words around the target word. Each dataset da, which corresponds to a different target text in Figure 2, was processed in less than 1 second on a 1.3GHz machine with 16GB memory. Weighted voting methods, DE BNW and DIBNW , perform best with MDUBs of just .02 and .03, respectively. Yule's coefficient, odds ratio, and pmi give near-identical, maximal accuracies for all four methods with a slightly greater divergence in DI BNW , where pmi does best. The ph coefficient performs best for unweighted methods. Dice and cosine do only slightly better than the baseline. In general, results from the method-measure combinations are symmetric across aBP0BM5, as they should be.</Paragraph> <Paragraph position="2"> Marked improvements in accuracy were achieved as a result of bootstrapping the WCCM (Figure 5). Most of the gain was provided by the first iteration itself, whereas further iterations resulted in just marginal improvements. All bootstrapped results reported in this paper pertain to just one iteration. Also, the bootstrapped WCCM is 72% smaller, and 5 times faster at processing the data sets, than the base WCCM, which has many non-zero cells even though the corresponding word and category never actually co-occurred (as mentioned in Section 3.2 earlier).</Paragraph> </Section> <Section position="3" start_page="126" end_page="126" type="sub_section"> <SectionTitle> 6.3 Discussion </SectionTitle> <Paragraph position="0"> Considering that this is a completely unsupervised approach, not only are the accuracies achieved using the weighted methods well above the baseline, but also remarkably close to the upper bound. This is especially true for a values close to 0 and 1. The lower accuracies for a near 0.5 are understandable as the amount of evidence towards both senses of the target word are nearly equal.</Paragraph> <Paragraph position="1"> Odds, pmi, and Yule perform almost equally well for all methods. Since the number of times two words co-occur is usually much less than the number of times they occur individually, pmi tends to approximate the logarithm of odds ratio. Also, Yule is a derivative of odds. Thus all three measures will perform similarly in case the co-occurring words give an unweighted vote for the most appropriate sense of the target as in DI</Paragraph> </Section> </Section> <Section position="9" start_page="126" end_page="126" type="metho"> <SectionTitle> BNU </SectionTitle> <Paragraph position="0"> and DE BNU. For the weighted voting schemes, DIBNW and DE BNW, the effect of scale change is slightly higher in DI BNW as the weighted votes are summed over the complete text to determine dominance. In</Paragraph> </Section> <Section position="10" start_page="126" end_page="126" type="metho"> <SectionTitle> DE </SectionTitle> <Paragraph position="0"> BNW the small number of weighted votes summed to determine the sense of the target word may be the reason why performances using pmi, Yule, and odds do not differ markedly. Dice coefficient and cosine gave below-baseline accuracies for a number of sense distributions. This suggests that the normalization4 to take into account the frequency of individual events inherent in the Dice and cosine measures may not be suitable for this task.</Paragraph> <Paragraph position="1"> The accuracies of the dominance methods remain the same if the target text is partitioned as per the target word, and each of the pieces is given individually to the disambiguation system. The average number of sentences per target word in each dataset da is 323. Thus the results shown above correspond to an average target text size of only 323 sentences.</Paragraph> <Paragraph position="2"> We repeated the experiments on the base WCCM after filtering out (setting to 0) cells with frequency less than 5 to investigate the effect on accuracies and gain in computation time (proportional to size of WCCM). There were no marked changes in accuracy but a 75% reduction in size of the WCCM. Using a window equal to the complete sentence as opposed to A65 words on either side of the target resulted in a drop of accuracies.</Paragraph> <Paragraph position="3"> 4If two events occur individually a large number of times, then they must occur together much more often to get substantial association scores through pmi or odds, as compared to cosine or the Dice coefficient.</Paragraph> </Section> class="xml-element"></Paper>