File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1051_metho.xml
Size: 12,748 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1051"> <Title>CORPUS-BASED STATISTICAL SENSE RESOLUTION</Title> <Section position="5" start_page="0" end_page="260" type="metho"> <SectionTitle> 3. METHODOLOGY </SectionTitle> <Paragraph position="0"> The training and testing contexts were taken from the .</Paragraph> <Paragraph position="1"> 1987-89 Wall Street Journal corpus and from the APHB corpus. 2 Sentences containing '\[L1\]ine(s)' were extracted and manually assigned a single sense from WordNet.</Paragraph> <Paragraph position="2"> Sentences containing proper names such as 'Japan Air Lines' were removed from the set of sentences. Sentences containing collocations that have a single sense in Word-Net, such as product line and line of products, were also excluded since the collocations are not ambiguous.</Paragraph> <Paragraph position="3"> Typically, experiments have used a fixed number of words or characters on either side of the target as the context. In this experiment, we used linguistic units sentences - instead. Since the target word is often used anaphorically to refer back to the previous sentence, we chose to use two-sentence contexts: the sentence containing line and the preceding sentence. However, if the sentence containing line is the first sentence in the article, then the context consists of one sentence. If the preceding sentence also contains line in the same sense, then an additional preceding sentence is added to the context, creating contexts three or more sentences long.</Paragraph> <Paragraph position="4"> :ZThe 25 million word corpus, obtained from the American Printing House for the Blind, is archlved at 1BM's T.J. Watson Research Center; it consists of stories and articles from books and general circulation magazines.</Paragraph> <Paragraph position="5"> The average size of the training and testing contexts is 44.5 words.</Paragraph> <Paragraph position="6"> The sense resolution task used the following six senses of the noun line: 1. a product: 'a new line of workstations' 2. a formation of people or things: 'stand in line' 3. spoken or written tezt: 'a line from Shakespeare' 4. a thin, flexible object; cord: 'a nylon line' 5. an abstract division: 'a line between good and evil' 6. a telephone connection: 'the line went dead' The classifiers were run three times each on randomly selected training sets. The set of contexts for each sense was randomly permuted, with each permutation corresponding to one trial. For each trial, the first 200 contexts of each sense were selected as training contexts. The next 149 contexts were selected as test contexts.</Paragraph> <Paragraph position="7"> The remaining contexts were not used in that trial. The 200 training contexts for each sense were combined to form a final training set (called the 200 training set) of size 1200. The final test set contained the 149 test contexts from each sense, for a total of 894 contexts.</Paragraph> <Paragraph position="8"> To test the effect that the number of training examples has on classifier performance, smaller training sets were extracted from the 200 training set. The first 50 and 100 contexts for each sense were used to build the new training sets. The same set of 894 test contexts were used with each of the training sets in a given trial. Each of the classifiers used the same training and test contexts within the same trial, but processed the text differently according to the needs of the method.</Paragraph> </Section> <Section position="6" start_page="260" end_page="262" type="metho"> <SectionTitle> 4. THE CLASSIFIERS </SectionTitle> <Paragraph position="0"> The only information used by the three classifiers is co-occurrence of character strings in the contexts. They use no other cues, such as syntactic tags or word order.</Paragraph> <Paragraph position="1"> Nor do they require any augmentation of the training contexts that is not fully automatic.</Paragraph> <Section position="1" start_page="260" end_page="260" type="sub_section"> <SectionTitle> 4.1. A Bayesian Approach </SectionTitle> <Paragraph position="0"> The Bayesian classifier, developed by Gale, Church and Yarowsky \[5\], uses Bayes' decision theory for weighting tokens that co-occur with each sense of a polysemous target. Their work is inspired by Mosteller and Wallace \[6\], who applied Bayes' theorem to the problem of author discrimination. The main component of the model, a token, was defined as any character string: a word, number, symbol, punctuation or any combination. The entire token is significant, so inflected forms of a base word (wait vs. waiting) and mixed case strings (Bush vs. bush) are distinct tokens. Associated with each token is a set of saliences, one for each sense, calculated from the training data. The salience of a token for a given sense is Pr(tolzenlsense)/Pr(token ). The weight of a token for a given sense is the log of its salience.</Paragraph> <Paragraph position="1"> To select the sense of the target word in a (test) context, the classifier computes the sum of the tokcns' weights over all tokens in the context for each sense, and selects the sense with the largest sum. In the case of author identification, Mosteller and Wallace built their models using high frequency function words.</Paragraph> <Paragraph position="2"> With sense resolution, the salient tokens include content words, which have much lower frequencies of occurrence.</Paragraph> <Paragraph position="3"> Gale, et. al. devised a method for estimating the required probabilities using sparse training data, since the maximum likelihood estimate (MLE) of a probability - the number of times a token appears in a set of contexts divided by the total number of tokens in the set of contexts - is a poor estimate of the true probability. In particular, many tokens in the test contexts do not appear in any training context, or appear only once or twice. In the former case, the MLE is zero, obviously smaller than the true probability; in the latter case, the MLE is much larger than the true probability. Gale, et. al. adjust their estimates for new or infrequent words by interpolating between local and global estimates of the probability.</Paragraph> <Paragraph position="4"> The Bayesian classifier experiments were performed by Kenneth Church of AT&T Bell Laboratories. In these experiments, two-sentence contexts are used in place of a fixed-sized window of +-50 tokens surrounding the target word that Gale, et. al. find optimal, s resulting in a smaller amount of context used to estimate the probabilities. null</Paragraph> </Section> <Section position="2" start_page="260" end_page="261" type="sub_section"> <SectionTitle> 4.2. Content Vectors </SectionTitle> <Paragraph position="0"> The content vector approach to sense resolution is motivated by the vector-space model of information retrieval systems \[8\], where each concept in a corpus defines an axis of the vector space, and a text in the corpus is represented as a point in this space. The concepts in a corpus are usually defined as the set of word stems that appear in the corpus (e.g., the strings computer(s), computing, computation(al), etc. are conflated to the concept comput) minus stopwords, a set of about 570 very high frequency words that includes function words (e.g., the, by, 7/ou, that, who, etc.) and content words (e.g., be, say, etc.). The similarity between two texts is computed as a function of the vectors representing the two texts.</Paragraph> <Paragraph position="1"> SWhereas current research tends to confirm the hypothesis that humans need a narrow window of ::I::2 words for sense resolution \[7\], Gale, et. al. have found much larger window sizes are better for the Bayesian c\]assliler, presumably because so much information (e.g., word order and syntax) is thrown away.</Paragraph> <Paragraph position="2"> network classifiers.</Paragraph> <Paragraph position="3"> For the sense resolution problem, each sense is represented by a single vector constructed from the training contexts for that sense. A vector in the space defined by the training contexts is also constructed for each test context. To select a sense for a test context, the inner product between its vector and each of the sense vectors is computed, and the sense whose inner product is the largest is chosen.</Paragraph> <Paragraph position="4"> The components of the vectors are weighted to reflect the relative importance of the concepts in the text. The weighting method was designed to favor concepts that occur frequently in exactly one sense. The weight of a concept c is computed as follows: Let n, : number of times c occurs in sense s p = n,/ ~senses n, d : difference between the two largest n, (if difference is 0, d is set to 1) thenw, = p*min(n,,d) For example, if a concept occurs 6 times in the training contexts of sense 1, and zero times in the other five sets of contexts, then its weights in the six vectors are (6, 0, 0, 0, 0, 0). However, a concept that appears 10, 4, 7, 0, 1, and 2 times in the respective senses, has weights of (1.25, .5, .88, 0, .04, .17), reflecting the fact that it is not as good an indicator for any sense. This weighting method is the most effective among several variants that were tried.</Paragraph> <Paragraph position="5"> We also experimented with keeping all words in the content vectors, but performance degraded, probably because the weighting function does not handle very high frequency words well. This is evident in Table 1, where 'mr' is highly weighted for three different senses.</Paragraph> </Section> <Section position="3" start_page="261" end_page="262" type="sub_section"> <SectionTitle> 4.3. Neural Network </SectionTitle> <Paragraph position="0"> The neural network approach \[9\] casts sense resolution as a supervised learning paradigm. Pairs of \[input features, desired response\] arc presented to a learning program.</Paragraph> <Paragraph position="1"> The program's task is to devise some method for using the input features to partition the training contexts into non-overlapping sets corresponding to the desired responses. This is achieved by adjusting link weights so that the output unit representing the desired response has a larger activation than any other output unit.</Paragraph> <Paragraph position="2"> Each context is translated into a bit-vector. As with the content vector approach, suffixes are removed to conflate related word forms to a common stem, and stop-words and punctuation axe removed. Each concept that appears at least twice in the entire training set is assigned to a bit-vector position. The resulting vector has ones in positions corresponding to concepts in the context and zeros otherwise. This procedure creates vectors with more than 4000 positions. The vectors are, however, extremely sparse; on average they contain slightly more than 17 concepts.</Paragraph> <Paragraph position="3"> Networks are trained until the output of the unit corresponding to the desired response is greater than the output of any other unit for every training example. For testing, the classification determined by the network is given by the unit with the largest output. Weights in a neural network link vector may be either positive or negative, thereby allowing it to accumulate evidence both for and against a sense.</Paragraph> <Paragraph position="4"> The result of training a network until all examples axe classified correctly is that infrequent tokens can acquire disproportionate importance. For example, the context 'Fine,' Henderson said, aimiably \[sic\]. 'Can 7/ou get hint on the liner' clearly uses line in the phone sense. However, the only non-stopwords that are infrequent in other senses are 'henderson' and 'aimiably'; and, due to its misspelling, the latter is conflated to 'aim'. The network must raise the weight of 'henderson' so that it is sufficient to give phone the largest output. As a result, 'henderson' appears in Table 1, in spite of its infrequency in the training corpus.</Paragraph> <Paragraph position="5"> To determine a good topology for the network, various network topologies were explored: networks with from 0 to 100 hidden units arranged in a single hidden layer; networks with multiple layers of hidden units; and networks with a single layer of hidden units in which the output units were connected to both the hidden and input units. In all cases, the network configuration with no hidden units was either superior or statistically indistinguishable from the more complex networks. As no network topology was significantly better than one with no hidden units, all data reported here are derived from such networks.</Paragraph> </Section> </Section> class="xml-element"></Paper>