File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3204_metho.xml
Size: 15,886 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3204"> <Title>Unsupervised WSD based on automatically retrieved examples: The importance of bias</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Previous work </SectionTitle> <Paragraph position="0"> As we have already mentioned, there is little work on this very promising area. In (Leacock et al., 1998), the method to obtain sense-tagged examples using monosemous relatives is presented. In this work, they retrieve the same number of examples per each sense, and they give preference to monosemous relatives that consist in a multiword containing the target word. Their experiment is evaluated on 3 words (a noun, a verb, and an adjective) with coarse sense-granularity and few senses. The results showed that the monosemous corpus provided precision comparable to hand-tagged data.</Paragraph> <Paragraph position="1"> In another related work, (Mihalcea, 2002) generated a sense tagged corpus (GenCor) by using a set of seeds consisting of sense-tagged examples from four sources: SemCor, WordNet, examples created using the method above, and hand-tagged examples from other sources (e.g., the Senseval-2 corpus). By means of an iterative process, the system obtained new seeds from the retrieved examples. An experiment in the lexical-sample task showed that the method was useful for a subset of the Senseval-2 testing words (results for 5 words are provided).</Paragraph> </Section> <Section position="4" start_page="1" end_page="5" type="metho"> <SectionTitle> 3 Experimental Setting for Evaluation </SectionTitle> <Paragraph position="0"> In this section we will present the Decision List method, the features used to represent the context, the two hand-tagged corpora used in the experiment and the word-set used for evaluation.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Decision Lists </SectionTitle> <Paragraph position="0"> The learning method used to measure the quality of the corpus is Decision Lists (DL). This algorithm is described in (Yarowsky, 1994). In this method, the sense s k with the highest weighted feature f</Paragraph> <Paragraph position="2"> lected, according to its log-likelihood (see Formula 1). For our implementation, we applied a simple smoothing method: the cases where the denominator is zero are smoothed by the constant 0.1 .</Paragraph> <Paragraph position="4"/> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.2 Features </SectionTitle> <Paragraph position="0"> In order to represent the context, we used a basic set of features frequently used in the literature for WSD tasks (Agirre and Martinez, 2000). We distinguish two types of features: Local features: Bigrams and trigrams, formed by the word-form, lemma, and part-of-speech of the surrounding words. Also the content lemmas in a +-4 word window around the target. null Topical features: All the content lemmas in the context.</Paragraph> <Paragraph position="1"> The PoS tagging was performed using TnT (Brants, 2000) We have analyzed the results using local and topical features separately, and also using both types together (combination).</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Hand-tagged corpora </SectionTitle> <Paragraph position="0"> Semcor was used as training data for our supervised system. This corpus offers tagged examples for many words, and has been widely used for WSD.</Paragraph> <Paragraph position="1"> It was necessary to use an automatic mapping between the WordNet 1.6 senses in Semcor and the WordNet 1.7 senses in testing (Daude et al., 2000).</Paragraph> <Paragraph position="2"> For evaluation, the test part of the Senseval-2 English lexical-sample task was chosen. The advantage of this corpus was that we could focus on a word-set with enough examples for testing. Besides, it is a different corpus, so the evaluation is more realistic than that made using cross-validation.</Paragraph> <Paragraph position="3"> The test examples whose senses were multiwords or phrasal verbs were removed, because they can be efficiently detected with other methods in a preprocess. null It is important to note that the training part of Senseval-2 lexical-sample was not used in the construction of the systems, as our goal was to test the performance we could achieve with minimal resources (i.e. those available for any word). We only relied on the Senseval-2 training bias in preliminary experiments on local/topical features (cf. Table 4), and to serve as a reference for unsupervised performance (cf. Table 5).</Paragraph> </Section> <Section position="4" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 3.4 Word-set </SectionTitle> <Paragraph position="0"> The experiments were performed on the 29 nouns available for the Senseval-2 lexical-sample task. We separated these nouns in 2 sets, depending on the number of examples they have in Semcor: Set A contained the 16 nouns with more than 10 examples in Semcor, and Set B the remaining low-frequency words.</Paragraph> <Paragraph position="1"> 4 Building the monosemous relatives web corpus In order to build this corpus , we have acquired 1000 Google snippets for each monosemous word in WordNet 1.7. Then, for each word sense of the ambiguous words, we gathered the examples of its monosemous relatives (see below). This method is inspired in (Leacock et al., 1998), and has shown to be effective in experiments of topic signature acquisition (Agirre and Lopez, 2004). This last paper also shows that it is possible to gather examples based on The automatically acquired corpus will be referred indistinctly as web-corpus, or monosemous-corpus monosemous relatives for nearly all noun senses in WordNet .</Paragraph> <Paragraph position="2"> The basic assumption is that for a given word sense of the target word, if we had a monosemous synonym of the word sense, then the examples of the synonym should be very similar to the target word sense, and could therefore be used to train a classifier of the target word sense. The same, but in a lesser extent, can be applied to other monosemous relatives, such as direct hyponyms, direct hypernyms, siblings, indirect hyponyms, etc. The expected reliability decreases with the distance in the hierarchy from the monosemous relative to the target word sense.</Paragraph> <Paragraph position="3"> The monosemous-corpus was built using the simplest technique: we collected examples from the web for each of the monosemous relatives. The relatives have an associated number (type), which correlates roughly with the distance to the target word, and indicates their relevance: the higher the type, the less reliable the relative. A sample of monosemous relatives for different senses of church, together with its sense inventory in WordNet 1.7 is shown in Figure 1.</Paragraph> <Paragraph position="4"> Distant hyponyms receive a type number equal to the distance to the target sense. Note that we assigned a higher type value to direct hypernyms than to direct hyponyms, as the latter are more useful for disambiguation. We also decided to include siblings, but with a high type value (3).</Paragraph> <Paragraph position="5"> In the following subsections we will describe step by step the method to construct the corpus. First we will explain the acquisition of the highest possible amount of examples per sense; then we will explain different ways to limit the number of examples per sense for a better performance; finally we will see the effect of training on local or topical features on this kind of corpora.</Paragraph> </Section> <Section position="5" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 4.1 Collecting the examples </SectionTitle> <Paragraph position="0"> The examples are collected following these steps with the monosemous relatives for each sense, and we extract the snippets as returned by the search engine. All snippets returned by Google are used (up to 1000). The list of snippets is sorted in reverse order. This is done because the top hits usually are titles and incomplete sentences that are not so useful.</Paragraph> <Paragraph position="1"> 2: We extract the sentences (or fragments of sentences) around the target search term. Some of the We use the offline XML interface kindly provided by Google for research.</Paragraph> <Paragraph position="2"> quired from the web for the three senses of church following the Semcor bias, and total examples in Semcor.</Paragraph> <Paragraph position="3"> sentences are discarded, according to the following criteria: length shorter than 6 words, having more non-alphanumeric characters than words divided by two, or having more words in uppercase than in lowercase. null 3: The automatically acquired examples contain a monosemous relative of the target word. In order to use these examples to train the classifiers, the monosemous relative (which can be a multi-word term) is substituted by the target word. In the case of the monosemous relative being a multiword that contains the target word (e.g. Protestant Church for church) we can choose not to substitute, because Protestant, for instance, can be a useful feature for the first sense of church. In these cases, we decided not to substitute and keep the original sentence, as our preliminary experiments on this corpus suggested (although the differences were not significant). null 4: For a given word sense, we collect the desired number of examples (see following section) in order of type: we first retrieve all examples of type 0, then type 1, etc. up to type 3 until the necessary examples are obtained. We did not collect examples from type 4 upwards. We did not make any distinctions between the relatives from each type. (Leacock et al., 1998) give preference to multiword relatives containing the target word, which could be an improvement in future work.</Paragraph> <Paragraph position="4"> On average, we have acquired roughly 24,000 examples for each of the target words used in this experiment. null</Paragraph> </Section> <Section position="6" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.2 Number of examples per sense (bias) </SectionTitle> <Paragraph position="0"> Previous work (Agirre and Martinez, 2000) has reported that the distribution of the number of examples per word sense (bias for short) has a strong influence in the quality of the results. That is, the results degrade significantly whenever the training and testing samples have different distributions of the senses.</Paragraph> <Paragraph position="1"> As we are extracting examples automatically, we have to decide how many examples we will use for (minimum ratio) columns correspond to different ways to apply Semcor bias. each sense. In order to test the impact of bias, different settings have been tried: No bias: we take an equal amount of examples for each sense.</Paragraph> <Paragraph position="2"> Web bias: we take all examples gathered from the web.</Paragraph> <Paragraph position="3"> Automatic ranking: the number of examples is given by a ranking obtained following the method described in (McCarthy et al., 2004).</Paragraph> <Paragraph position="4"> They used a thesaurus automatically created from the BNC corpus with the method from (Lin, 1998), coupled with WordNet-based similarity measures.</Paragraph> <Paragraph position="5"> Semcor bias: we take a number of examples proportional to the bias of the word senses in Semcor.</Paragraph> <Paragraph position="6"> For example, Table 1 shows the number of examples per type (0,1,...) that are acquired for church following the Semcor bias. The last column gives the number of examples in Semcor.</Paragraph> <Paragraph position="7"> We have to note that the 3 first methods do not require any hand-labeled data, and that the fourth relies in Semcor.</Paragraph> <Paragraph position="8"> The way to apply the bias is not straightforward in some cases. In our first approach for Semcorbias, we assigned 1,000 examples to the major sense in Semcor, and gave the other senses their proportion of examples (when available). But in some cases the distribution of the Semcor bias and that of the actual examples in the web would not fit. The problem is caused when there are not enough examples in the web to fill the expectations of a certain word sense.</Paragraph> <Paragraph position="9"> We therefore tried another distribution. We computed, for each word, the minimum ratio of examples that were available for a given target bias and a given number of examples extracted from the web. We observed that this last approach would reflect better the original bias, at the cost of having less examples. null Table 2 presents the different distributions of examples for authority. There we can see the Senseval-testing and Semcor distributions, together with the total number of examples in the web; the Semcor proportional distribution (Pr) and minimum ratio (MR); and the automatic distribution. The table illustrates how the proportional Semcor bias produces a corpus where the percentage of some of sense distributions. Minimum-ratio is applied for the Semcor and automatic bias.</Paragraph> <Paragraph position="10"> the senses is different from that in Semcor, e.g. the first sense only gets 33.7% of the examples, in contrast to the 60% it had in Semcor.</Paragraph> <Paragraph position="11"> We can also see how the distributions of senses in Semcor and Senseval-test have important differences, although the main sense is the same. For the web and automatic distributions, the first sense is different; and in the case of the web distribution, the first hand-tagged sense only accounts for 0.5% of the examples retrieved from the web. Similar distribution discrepancies can be observed for most of the words in the test set. The Semcor MR column shows how using minimum ratio we get a better reflection of the proportion of examples in Semcor, compared to the simpler proportional approach (Semcor Pr). For the automatic bias we only used the minimum ratio.</Paragraph> <Paragraph position="12"> To conclude this section, Table 3 shows the number of examples acquired automatically following the web bias, the Semcor bias with minimum ratio, and the Automatic bias with minimum ratio.</Paragraph> </Section> <Section position="7" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 4.3 Local vs. topical features </SectionTitle> <Paragraph position="0"> Previous work on automatic acquisition of examples (Leacock et al., 1998) has reported lower performance when using local collocations formed by PoS tags or closed-class words. We performed an early experiment comparing the results using local features, topical features, and a combination of both.</Paragraph> <Paragraph position="1"> In this case we used the web corpus with Senseval training bias, distributed according to the MR approach, and always substituting the target word. The recall (per word and overall) is given in Table 4.</Paragraph> <Paragraph position="2"> In this setting, we observed that local collocations achieved the best precision overall, but the combination of all features obtained the best recall. The table does not show the precision/coverage figures due to space constraints, but local features achieve 58.5% precision for 96.7% coverage overall, while topical and combination of features have full-coverage.</Paragraph> <Paragraph position="3"> There were clear differences in the results per word, showing that estimating the best feature-set per word would improve the performance. For the corpus-evaluation experiments, we chose to work with the combination of all features.</Paragraph> </Section> </Section> class="xml-element"></Paper>