File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0858_metho.xml
Size: 11,992 bytes
Last Modified: 2025-10-06 14:09:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0858"> <Title>Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> This section presents various aspects of the system in roughly the order in which they are executed. The following definitions will simplify the description.</Paragraph> <Paragraph position="1"> Head Word: One of the 57 words that are to be disambiguated.</Paragraph> <Paragraph position="2"> Example: One or more contiguous sentences, illustrating the usage of a head word.</Paragraph> <Paragraph position="3"> Context: The non-head words in an example.</Paragraph> <Paragraph position="4"> Feature: A property of a head word in a context. For instance, the feature tag hp1 NNPis the prop-erty of having (or not having) a proper noun (NNP is the part-of-speech tag for a proper noun) immediately following the head word (hp1 represents the location head plus one).</Paragraph> <Paragraph position="5"> Feature Value: Features have values, which depend on the specific example. For instance, tag hp1 NNP is a binary feature that has the value 1 (true: the following word is a proper noun) or 0 (false: the following word is not a proper noun).</Paragraph> <Paragraph position="6"> Feature Vector: Each example is represented by a vector. Features are the dimensions of the vector space and a vector of feature values specifies a point in the feature space.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Association for Computational Linguistics </SectionTitle> <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Preprocessing </SectionTitle> <Paragraph position="0"> The NRC WSD system first assigns part-of-speech tags to the words in a given example (Brill, 1994), and then extracts a nine-word window of tagged text, centered on the head word (i.e., four words before and after the head word). Any remaining words in the example are ignored (usually most of the example is ignored). The window is not allowed to cross sentence boundaries. If the head word appears near the beginning or end of the sentence, where the window may overlap with adjacent sentences, special null characters fill the positions of any missing words in the window.</Paragraph> <Paragraph position="1"> In rare cases, a head word appears more than once in an example. In such cases, the system selects a single window, giving preference to the earliest occurring window with the least nulls. Thus each example is converted into one nine-word window of tagged text. Windows from the training examples for a given head word are then used to build the feature set for that head word.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Syntactic Features </SectionTitle> <Paragraph position="0"> Each head word has a unique set of feature names, describing how the feature values are calculated.</Paragraph> <Paragraph position="1"> Feature Names: Every syntactic feature has a name of the form matchtype position model. There are three matchtypes, ptag, tag, and word, in order of increasingly strict matching. A ptag match is a partial tag match, which counts similar part-of-speech tags, such as NN (singular noun), NNS (plural noun), NNP (singular proper noun), and NNPS (plural proper noun), as equivalent. A tag match requires exact matching in the part-of-speech tags for the word and the model. Awordmatch requires that the word and the model are exactly the same, letter-for-letter, including upper and lower case.</Paragraph> <Paragraph position="2"> There are five positions, hm2 (head minus two), hm1 (head minus one), hd0 (head), hp1 (head plus one), and hp2 (head plus two). Thus syntactic features use only a five-word sub-window of the nine-word window.</Paragraph> <Paragraph position="3"> The syntactic feature names for a head word are generated by all of the possible legal combinations of matchtype, position, and model. For ptag names, the model can be any partial tag. For tag names, the model can be any tag. For word names, the model names are not predetermined; they are extracted from the training windows for the given head word. For instance, if a training window contains the head word followed by &quot;of&quot;, then one of the features will be word hp1 of.</Paragraph> <Paragraph position="4"> For word names, the model names are not allowed to be words that are tagged as nouns, verbs, or adjectives. These words are reserved for use in building the semantic features.</Paragraph> <Paragraph position="5"> Feature Values: The syntactic features are all binary-valued. Given a feature with a name of the form matchtype position model, the feature value for a given window depends on whether there is a match of matchtype between the word in the position position and the model model. For instance, the value of tag hp1 NNP depends on whether the given window has a word in the position hp1 (head plus one) with a tag (part-of-speech tag) that matches NNP (proper noun). Similarly, the feature word hp1 of has the value 1 (true) if the given window contains the head word followed by &quot;of&quot;; otherwise, it has the value 0 (false).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Semantic Features </SectionTitle> <Paragraph position="0"> Each head word has a unique set of feature names, describing how the feature values are calculated.</Paragraph> <Paragraph position="1"> Feature Names: Most of the semantic features have names of the form position model. The position names can be pre (preceding) or fol (following).</Paragraph> <Paragraph position="2"> They refer to the nearest noun, verb, or adjective that precedes or follows the head word in the nine-word window.</Paragraph> <Paragraph position="3"> The model names are extracted from the training windows for the head word. For instance, if a training window contains the word &quot;compelling&quot;, and this word is the nearest noun, verb, or adjective that precedes the head word, then one of the features will be pre compelling.</Paragraph> <Paragraph position="4"> A few of the semantic features have a different form of name, avg position sense. In names of this form, position can be pre (preceding) or fol (following), and sense can be any of the possible senses (i.e., classes, labels) of the head word.</Paragraph> <Paragraph position="5"> Feature Values: The semantic features are all real-valued. For feature names of the form position model, the feature value depends on the semantic similarity between the word in position position and the model word model.</Paragraph> <Paragraph position="6"> The semantic similarity between two words is estimated by their Pointwise Mutual Information,</Paragraph> <Paragraph position="8"> We estimate the probabilities in this equation by issuing queries to the Waterloo MultiText System (Clarke et al., 1995; Clarke and Cormack, 2000; Terra and Clarke, 2003). Laplace smoothing is applied to the PMI estimates, to avoid division by a4a5a6a4a7a8 has a value of zero when the two words are statistically independent. A high positive value indicates that the two words tend to cooccur, and hence are likely to be semantically related. A negative value indicates that the presence of one of the words suggests the absence of the other. Past work demonstrates that PMI is a good estimator of semantic similarity (Turney, 2001; Terra and Clarke, 2003) and that features based on PMI can be useful for supervised learning (Turney, 2003).</Paragraph> <Paragraph position="9"> The Waterloo MultiText System allows us to set the neighbourhood size for co-occurrence (i.e., the meaning of a4a5 a15 a4a7). In preliminary experiments with the ELS data from Senseval-2, we got good results with a neighbourhood size of 20 words.</Paragraph> <Paragraph position="10"> For instance, if a4 is the noun, verb, or adjective that precedes the head word and is nearest to the head word in a given window, then the value of pre compelling is a0a1a2a3a4 a6a0a11a1a2a3a10a10a4a5a12a8. If there is no preceding noun, verb, or adjective within the window, the value is set to zero.</Paragraph> <Paragraph position="11"> In names of the form avg position sense, the feature value is the average of the feature values of the corresponding features. For instance, the value of avg pre argument 1 10 02 is the average of the values of all of the pre model features, such that model was extracted from a training window in which the head word was labeled with the sense argument 1 10 02.</Paragraph> <Paragraph position="12"> The idea here is that, if a testing example should be labeled, say, argument 1 10 02, and a4a5 is a noun, verb, or adjective that is close to the head word in the testing example, then a0a1a2a3a4a5a6a4a7a8 should be relatively high when a4a7 is extracted from a training window with the same sense, argument 1 10 02, but relatively low when a4a7 is extracted from a training window with a different sense. Thus avg position argument 1 10 02 is likely to be relatively high, compared to other avg position sense features.</Paragraph> <Paragraph position="13"> All semantic features with names of the form position model are normalized by converting them to percentiles. The percentiles are calculated separately for each feature vector; that is, each feature vector is normalized internally, with respect to its own values, not externally, with respect to the other feature vectors. The pre features are normalized independently from the fol features. The semantic features with names of the formavg position sense are calculated after the other features are normalized, so they do not need any further normalization. Preliminary experiments with the ELS data from Senseval-2 supported the merit of percentile normalization, which was also found useful in another application where features based on PMI were used for supervised learning (Turney, 2003).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Weka Configuration </SectionTitle> <Paragraph position="0"> Table 1 shows the commands that were used to execute Weka (Witten and Frank, 1999). The default parameters were used for all of the classifiers. Five base classifiers (-B) were combined by voting. Multiple classes were handled by treating them as multiple two-class problems, using a 1-against-all strategy. Finally, the variance of the system was reduced with bagging.</Paragraph> <Paragraph position="1"> We designed the Weka configuration by evaluating many different Weka base classifiers on the Senseval-2 ELS data, until we had identified five good base classifiers. We then experimented with combining the base classifiers, using a variety of meta-learning algorithms. The resulting system is somewhat similar to the JHU system, which had the best ELS scores in Senseval-2 (Yarowsky et al., 2001). The JHU system combined four base classifiers using a form of voting, called Thresholded Model Voting (Yarowsky et al., 2001).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Postprocessing </SectionTitle> <Paragraph position="0"> The output of Weka includes an estimate of the probability for each prediction. When the head word is frequently labeled U (unassignable) in the training examples, we ignore U examples during training, and then, after running Weka, relabel the lowest probability testing examples as U.</Paragraph> </Section> </Section> class="xml-element"></Paper>