File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0833_metho.xml
Size: 6,526 bytes
Last Modified: 2025-10-06 14:09:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0833"> <Title>Simple Features for Statistical Word Sense Disambiguation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Na ve Bayes for Learning Context Words </SectionTitle> <Paragraph position="0"> In our approach, a large window and a smaller sub-window are centered around the target word. We account for all words within the sub-window but use a POS lter as well as a short stop-word list to lter out non-content words</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Association for Computational Linguistics </SectionTitle> <Paragraph position="0"> dow and sub-window sizes for the word bank.n.</Paragraph> <Paragraph position="1"> The best accuracy is achieved with a window and sub-window size of around 450 and 50 characters respectively, while for example 50 and 25 provide very low accuracy.</Paragraph> <Paragraph position="2"> from the context. The lter retains only open class words, i.e. nouns, adjectives, adverbs, and verbs, and rejects words tagged otherwise.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Changing the context window size </SectionTitle> <Paragraph position="0"> Figure 1 shows the e ect of selecting di erent window and sub-window sizes for the word bank.n. It is clear that precision is very sensitive to the selected window size. Other words also have such variations in their precision results.</Paragraph> <Paragraph position="1"> The system decides on the best window sizes for every word by examining possible window size values ranging from 25 to 750 characters1.</Paragraph> <Paragraph position="2"> Table 1 shows the optimal window sizes selected for a number of words from di erent word classes. The baseline is considered individually for every word as the ratio of the most common sense in the training samples. We used the Senseval-3 training set for the English lexical sample task for training. It includes a total of 7860 tagged samples for 57 ambiguous words.</Paragraph> <Paragraph position="3"> 15% of this data was used for validation, while the rest was used for training.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Approximate Smoothing </SectionTitle> <Paragraph position="0"> During the testing phase, given the context of the target word, the score of every sense is computed using the Na ve Bayes formula:</Paragraph> <Paragraph position="2"> where, wordk is every word inside the context window (recall that these are all the words in 1For technical reasons, character is used instead of word as the unit, making sure no word is cut at the extremities.</Paragraph> <Paragraph position="3"> the sub-window, and ltered words in the large window).</Paragraph> <Paragraph position="4"> Various smoothing algorithms could be used to reduce the probability of seen words and distributing them among unseen words. However, tuning various smoothing parameters is delicate as it involves keeping an appropriate amount of held-out data. Instead, we implemented an approximate smoothing method, which seems to perform better compared to Ng's (Ng, 1997) approximate smoothing. In our simple approximate smoothing the probability of seen words is not discounted to compensate for those of unseen words2. Finding a proper value to assign to unseen words was done experimentally; for a relatively large training data set, p(an unseen word) = 10 10 and for a small set, 10 9 resulted in the highest accuracy with our 15% validation set3. The intuition is that, with a small training set, more unseen words are likely to be seen during the testing phase, and in order to prevent the accumulating score penalty value from becoming relatively high, a lower probability value is selected. Additionally, the selection should not result in large di erences in the computed scores of di erent senses of the target word.</Paragraph> <Paragraph position="5"> A simple function assigns 10 10 in any of the following conditions: the total number of words seen is larger than 4300, the number of training instances is greater than 230, or the context window size is larger than 400 characters. The function returns 10 9 otherwise.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Maximum Entropy learning of </SectionTitle> <Paragraph position="0"> syntax and semantics Syntactic structures as well as semantics of the words around the ambiguous word are strong clues for sense resolution in many words. However, deriving and using exact syntactic information introduces its own di culties. So, we tried to use approximate syntactic structures by learning the following features in a context window bounded by the last punctuation before and the rst punctuation after the ambiguous word: 1. Article Bef: If there is any article before, the string token is considered as the value of this feature.</Paragraph> <Paragraph position="1"> 2. POS, POS Bef, POS Aft: The part of speech of the target, the part of speech of the word before (after) if any.</Paragraph> <Paragraph position="2"> performance of both systems for the words on which Max Entropy has performed better than Na ve Bayes. (WS=Optimal window size; SW=Optimal sub-window size; Di =Average absolute di erence between the distribution of training and test samples; Accuracy (Base=Baseline; Bys=Na ve Bayes; Ent=Max Entropy)).</Paragraph> <Paragraph position="3"> 3. Prep Bef, Prep Aft: The last preposition before, and the rst preposition after the target, if any.</Paragraph> <Paragraph position="4"> 4. Sem Bef, Sem Aft: The general semantic category of the noun before (after) the target. The category, which can be 'animate', 'inanimate', or 'abstract', is computed by traversing hypernym synsets of WordNet for all the senses of that noun. The rst semantic category observed is returned, or 'inanimate' is returned as the default value. The rst three items are taken from Mihalcea's work (Mihalcea, 2002) which are useful features for most of the words. The range of all these features are closed sets; so Maximum Entropy is not biased by the distribution of training samples among senses, which is a side-e ect of Na ve Bayes learners (see Section 5.2)4. The following is an example of the features extracted for sample miss.v.bnc.00045286:</Paragraph> </Section> class="xml-element"></Paper>