File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1047_metho.xml

Size: 16,945 bytes

Last Modified: 2025-10-06 14:13:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1047">
  <Title>A New Approach to Word Sense Disambiguation</Title>
  <Section position="3" start_page="0" end_page="244" type="metho">
    <SectionTitle>
2. MODEL SELECTION
</SectionTitle>
    <Paragraph position="0"> In this section, we address the problem of finding the model that generates the best approximation to a given discrete probability distribution, as selected from among the class of decomposable models. Decomposable models are a subclass of log-linear models and can be used to characterize and study the structure of data. They are members of the class of generalized linear models and can be viewed as analogous to analysis of variance (ANOVA) models (\[1\]. The log-linear  model expresses the population mean as the sum of the contributions of the &amp;quot;effects&amp;quot; of the variables and the interactions between variables; it is the logarithm of the mean that is linear in these effects.</Paragraph>
    <Paragraph position="1"> Under certain sampling plans (see \[1\] for details), data consisting of the observed values of a number of contextual features and the corresponding sense tags of an ambiguous word can be described by a multinomial distribution in which each distinct combination of the values of the contextual features and the sense tag identifies a unique category in that distribution. The theory of log-linear models specifies the su.~cient statistics for estimating the effects of each variable and of each interaction among variables on the mean. The statistics are the highest-order sample marginal distributions conraining only inter-dependent variables. Within the class of decomposable models, the maximum likelihood estimate for the mean of a category reduces to the product of the sample relative frequencies (counts) defined in the sufficient statistics divided by the sample relative frequencies defined in the marginals composed of the common elements in the sufficient statistics. As such, decomposable models are models that can be expressed as a product of marginal distributions, where each marginal consists of certain inter-dependent variables.</Paragraph>
    <Paragraph position="2"> The degree to which the data is approximated by a model is called the fit of the model. In this work, the likelihood ratio statistic, G 2, is used as the measure of the goodness of fit of a model. It is distributed asymptotically as X 2 with degrees of freedom corresponding to the number of interactions (and/or variables) omitted from (unconstrained in) the model. Accessing the fit of a model in terms of the significance of its G 2 statistic gives preference to models with the fewest number of interdependencies, thereby assuring the selection of a model specifying only the most systematic variable interactions.</Paragraph>
    <Paragraph position="3"> Within the framework described above, the process of model selection becomes one of hypothesis testing, where each pattern of dependencies among variables expressible in terms of a decomposable model is postulated as a hypothetical model and its fit to the data is evaluated. The &amp;quot;best fitting&amp;quot; model, in the sense that the significance according to the reference X 2 value is largest, is then selected. The exhaustive search of decomposable models was conducted as described in \[9\].</Paragraph>
    <Paragraph position="4"> Approximating the joint distribution of all variables with a model containing only the most important systematic interactions among variables limits the number of parameters to be estimated, supports computational efficiency, and provides an understanding of the data. The biggest limitation associated with this method is the need for large amounts of sense-tagged data. Inconveniently, the validity of the results obtained using this approach are compromised when it is applied to sparse data.</Paragraph>
  </Section>
  <Section position="4" start_page="244" end_page="245" type="metho">
    <SectionTitle>
3. THE MODEL
</SectionTitle>
    <Paragraph position="0"> Using the method presented in the previous section, a probabilistic model was developed for disambiguating the noun senses of interest utilizing automatically identifiable contextual features that were considered to be intuitively applicable to all content words. The complete process of feature selection and model selection is described in \[3\]. Here, we describe the extension of that model to other content words.</Paragraph>
    <Paragraph position="1"> In essence, what we are describing is not a single model, but a model schema. The values of the variables included in the model change with the word being disambiguated as stated below.</Paragraph>
    <Paragraph position="2"> The model schema incorporates three different types of contextual features: morphological, collocation-specific, and class-based, with POS categories serving as the word classes.</Paragraph>
    <Paragraph position="3"> For all content words, the morphological feature describes only the suffix of the base lexeme: the presence or absence of the plural form, in the case of nouns, and the suffix indicating tense, in the case of verbs. Mass nouns as well as many adjectives and adverbs will have no morphological feature under this definition (note the lack of this feature in the models for common in table 2).</Paragraph>
    <Paragraph position="4"> The values of the class-based variables are a set of 25 POS tags derived from the first letter of the tags used in the Penn Treebank corpus. The model schema contains four variables representing class-based contextual features: the POS tags of the two words immediately preceding and the two words immediately succeeding the ambiguous word. All variables are confined to sentence boundaries; extension beyond the sentence boundary is indicated by a null POS tag (e.g., when the ambiguous word appears at the start of the sentence, the POS tags to the left have the value null).</Paragraph>
    <Paragraph position="5"> Two collocation-specific variables are included in the model schema, where the term collocation is used loosely to refer to a specific spelling form occurring in the same sentence as the ambiguous word. In the model schema, each collocation-specific variable indicates the presence or absence of a word that is one of the four most frequently-occurring content words in a data sample composed of sentences containing the word to be disambiguated. This strategy for selecting collocation-specific variables is simpler than that used by many other researchers (\[6\], \[15\], \[2\]). This simpler method was chosen to support work we plan to do in the future (eliminating the need for sense-tagged data; see section 6). In using this strategy, we do, however, run the risk of reducing the informativeness of the variables.</Paragraph>
    <Paragraph position="6"> With the variables as described above, the form of this model is (where rlpos is the POS tag one place to the right of the ambiguous word W; r~pos is the POS tag two places to the right of W; llpos is the POS tag one place to the left of W; l~pos is the POS tag two places to the left of W; endingis the suffix of the base lexeme; word1 is the presence or absence of one of the word-specific collocations and words is the presence or absence of the other one; and tag is the sense tag assigned</Paragraph>
    <Paragraph position="8"> This product form indicates certain conditional independences given the sense tag of the ambiguous word. In the remainder of this paper, the model for a particular word  matching the above schema will be referred to as model M. The sense for an ambiguous word is selected using M as follows: null</Paragraph>
    <Paragraph position="10"/>
  </Section>
  <Section position="5" start_page="245" end_page="245" type="metho">
    <SectionTitle>
4. THE EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> In this section, we first describe the data used in the experiments and then describe the experiments themselves.</Paragraph>
    <Paragraph position="1"> Due to availability, the Penn Treebank Wall Street Journal corpus was selected as the data set and the non-idiomatic senses defined in the electronic version of the Longman's Dictionary of Contemporary English LDOCE were chosen to form the tag set for each word to be disambiguated (three exceptions to this statement are noted in table 1). The only restriction limiting the choice of ambiguous words was the need for large amounts of sense-tagged data. As a result of that restriction, only the most frequently occurring content words could be considered. From that set, the following were chosen as test cases: the noun senses of bill and concern, the verb senses of close and help, and the adjective senses of common.</Paragraph>
    <Paragraph position="2"> The training and test sets for each word selected for disambiguation were generated in the same manner. First, all instances of the word with the specified POS tag in the Penn Treebank Wall Street Journal Corpus were identified and the sentences containing them were extracted to form a data sample. The data sample was then manually disambiguated and a test set comprising approximately one quarter of the total sample size was randomly selected. The size of the data sample, test set, and training set for each word, along with a description of the word senses identified and their distribution in the data are presented in table 1. Table 1 also includes entries for the earlier experiments involving the noun interest (\[3\]).</Paragraph>
    <Paragraph position="3"> In all of the experiments for a particular word, the estimates of the model parameters that were used were maximum likelihood estimates made from the training set for that word.</Paragraph>
    <Paragraph position="4"> In each experiment, a set of data was tagged in accordance with equation (2), and the results were summarized in terms of precision and recall. (In most of the experiments, the data set was the test set, as expected, but in the experiments designed to establish an upper bound for performance, it was the training set, as discussed below.) Recall is the percentage of test words that were assigned some tag; it corresponds to the portion of the test set covered by the estimate3 of the parameters made from the training set. Precision is the percentage of tagged words that were tagged correctly. A combined summary, the total percentage of the test set tagged correctly (the total percent correct) was also calculated.</Paragraph>
    <Paragraph position="5"> There were three experiments run for each word. In the first, the data set tagged was the test set and model M was used.</Paragraph>
    <Paragraph position="6"> In the second, the data set tagged was the test set, and the model was the one selected using the procedure described in section 2 for the word being disambiguated and the contextual features used throughout the experiments. We will refer to this as the &amp;quot;best approximation model&amp;quot;. In the third experiment, the data set tagged was the training set, and the model used was the one in which no assumptions are made about dependencies among variables (i.e., all variables are treated as inter-dependent). The purpose of experiment three was to establish upper bounds on the precision of the classifiers used in the first two experiments, as discussed in the following paragraphs.</Paragraph>
    <Paragraph position="7"> J If a classifier makes no assumptions regarding the dependencies among the'variables, and has available to it the actual paraaneter values (i.e., the true population characteristics), then the precision of that classifier would be the best that could be achieved with the specified set of features. The maximum likelihood estimates of the model parameters made from the training set are the population parameters for the training set; therefore, the precision of each third-experiment classifier is optimal for the training set. Because the true population will have more variation than the training set, the third experiment for each word establishes an upper bound for the precision of the classifiers tested in the first two experiments for that word (and in fact, for any classifier using the same set of variables).</Paragraph>
    <Paragraph position="8"> If we assume that the test and training sets have similar sense-tag distributions, establishing a lower bound is straightforward. &amp;quot;A probabilistic classifier should perform at least as well as one that always assigns the sense that most frequently occurs in the training set. Thus, a lower bound on the precision of a probabilistic classifier is the percentage of test-word instances with the sense tag that most frequently occurs.</Paragraph>
    <Paragraph position="9"> The results of all of the experiments, including the earlier experiments involving the noun senses of interest (\[3\]), are presented in table 2.</Paragraph>
  </Section>
  <Section position="6" start_page="245" end_page="246" type="metho">
    <SectionTitle>
5. DISCUSSION OF RESULTS
</SectionTitle>
    <Paragraph position="0"> In the following discussion, a classifier used in the first or second experiment for a word will be called an &amp;quot;experimental classifier&amp;quot;, while a classifier used in the third experiment for a word will be referred to as the &amp;quot;upper-bound classifier&amp;quot; for that word.</Paragraph>
    <Paragraph position="1"> Before discussing the results of the experiments, there are some comments to be made about the comparison of the performance of different classifiers. In comparing the performance of classifiers developed for the same word, it makes sense to compare the precision, recall, and total percent correct. Because the training set and the test set are the same, the differences we see are due strictly to the fact that they use different models. In comparing the performance of classifters developed for different words, on the other hand, only the precision measures are compared. There are two things that affect recall: the complexity of the model (i.e., the order of the highest-order marginal in the model) and the size of the training set. The size of the training set was not held constant for each word; therefore, comparison of the recall results for classifiers developed for different words would not be meaningful. Because total percent correct includes recall,  it should also not be used in the comparison of classifiers developed for different words.</Paragraph>
    <Paragraph position="2"> In comparing the precision of classifiers developed for different words, what is compared is the improvement that each classifier makes over the lower bound for the word for which that classifier was developed.</Paragraph>
    <Paragraph position="3"> We now turn to the specific results. Model M seems particularly well suited to the nouns (which is not surprising, given that it was developed for the noun-senses of the word interest). The precision of the noun experimental classifiers is superior to that of all of the experimental classifiers developed for words in other syntactic categories. Further, for one of the nouns (concern), M was the same as the one used in experiment 2, and, for the other two nouns, M and the model used in experiment 2 are very similar.</Paragraph>
    <Paragraph position="4"> Turning to the verbs, it is striking that, for both of the verbs, the models used in the second experiment (the best approximation models) identify an interdependency between tense markings (i.e., ending in the verb entries in table 2) and the POS tags (rlpose, r~pos, llpos, and 12pos), a dependency that is not in M. This seems to suggest that a model including this dependency should be used for verbs. However, the additional complexity of such a model in comparison with M may make it less effective. For each verb we tested, a comparison of the total-percent-correct measures for experiments 1 and 2 indicates that the classifier with Mis as good or better than the classifier using the best approximation model.</Paragraph>
    <Paragraph position="5"> The classifiers with the worst precision in comparison with the appropriate lower bound, as discussed above, are the experimental classifiers for the verb senses of help. The sense distinctions for help are based mainly on the semantic class of the syntactic object of the verb. Perhaps this approa~ch to sense disambiguation is not as effective for these kinds of sense distinctions.</Paragraph>
    <Paragraph position="6"> Although there is a large disparity in performance between the experimental and upper-bound classifiers for a word, two things should be noted. First, the upper bounds are overinflated due to the very small size of the training set relative to the true population (there would be much greater variation in the population). Second, such a model could never be used in practice, due to the huge number of parameters to be estimated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML