XML Viewer - w97-0322

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0322_metho.xml
Size: 18,511 bytes
Last Modified: 2025-10-06 14:14:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0322">
  <Title>Distinguishing Word Senses in Untagged Text</Title>
  <Section position="4" start_page="197" end_page="197" type="metho">
    <SectionTitle>
II~K -~rII 2 VKL- , + I (1)
NK I~FL
</SectionTitle>
    <Paragraph position="0"> where XK is the mean observation for cluster CK, NK is the number of observations in CK, and ~L and NL are defined similarly for CL.</Paragraph>
    <Paragraph position="1"> Implicit in Ward's method is the assumption that the sample comes from a mixture of normal distributions. While NLP data is typically not well characterized by a normal distribution (see, e.g. (Zipf, 1935), (Pedersen, Kayaalp, and Bruce, 1996)), there is evidence that our data, when represented by a dissimilarity matrix, can be adequately characterized by a normal distribution. However, we will continue to investigate the appropriateness of this assumption. null</Paragraph>
    <Section position="1" start_page="197" end_page="197" type="sub_section">
      <SectionTitle>
2.2 McQuitty's similarity analysis
</SectionTitle>
      <Paragraph position="0"> In McQuitty's method, clusters are based on a simple averaging of the feature mismatch counts found in the dissimilarity matrix.</Paragraph>
      <Paragraph position="1"> At each step in McQuitty's method, a new cluster, CKL, is formed by merging the clusters CK and CL that have the fewest number of dissimilar features between them. The clusters to be merged, CK and CL, are identified by finding the cell (/, k) (or (k, I)), where k ~ l, that has the minimum value in the dissimilarity matrix.</Paragraph>
      <Paragraph position="2"> Once the new cluster CKL is created, the dissimilarity matrix is updated to reflect the number of dissimilar features between CKL and all other existing clusters. The dissimilarity between any existing cluster Ci and CKL is computed as:</Paragraph>
      <Paragraph position="4"> where DKi is the number of dissimilar features between clusters CK and Ci and DLI is similarly defined for clusters CL and C1. This is simply the average number of mismatches between each component of the new cluster and the existing cluster.</Paragraph>
      <Paragraph position="5"> Unlike Ward's method, McQuitty's method makes no assumptions concerning the distribution of the data sample.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="197" end_page="198" type="metho">
    <SectionTitle>
3 EM Algorithm
</SectionTitle>
    <Paragraph position="0"> The expectation maximization algorithm (Dempster, Laird, and Rubin, 1977), commonly known as the EM algorithm, is an iterative estimation procedure in which a problem with missing data is recast to make use of complete data estimation techniques.</Paragraph>
    <Paragraph position="1"> In our work, the sense of an ambiguous word is represented by a feature whose value is missing.</Paragraph>
    <Paragraph position="2"> In order to use the EM algorithm, the parametric form of the model representing the data must be known. In these experiments, we assume that the model form is the Naive Bayes (Duda and Hart, 1973). In this model, all features are conditionally independent given the value of the classification feature, i.e., the sense of the ambiguous word. This assumption is based on the suc- null cess of the Naive Bayes model when applied to supervised word-sense disambiguation (e.g. (Gale, Church, and Yarowsky, 1992), (Leacock, Towell, and Voorhees, 1993), (Mooney, 1996), (Pedersen, Bruce, and Wiebe, 1997), (Pedersen and Bruce, 1997a)).</Paragraph>
    <Paragraph position="3"> There are two potential problems when using the EM algorithm. First, it is computationally expensive and convergence can be slow for problems with large numbers of model parameters. Unfortunately there is little to be done in this case other than reducing the dimensionality of the problem so that fewer parameters are estimated. Second, if the likelihood function is very irregular it may always converge to a local maxima and not find the global maximum. In this case, an alternative is to use the more computationally expensive method of Gibbs Sampling (Geman and Geman, 1984).</Paragraph>
    <Section position="1" start_page="198" end_page="198" type="sub_section">
      <SectionTitle>
3.1 Description
</SectionTitle>
      <Paragraph position="0"> At the heart of the EM Algorithm lies the Qfunction. This is the expected value of the log-likelihood function for the complete data D = (Y, S), where Y is the observed data and S is the missing sense value:</Paragraph>
      <Paragraph position="2"> Here, /9 is the current value of the maximum likelihood estimates of the model parameters and/9i is the improved estimate that we are seeking; p(Y, SI/9 i) is the likelihood of observing the complete data given the improved estimate of the model parameters.</Paragraph>
      <Paragraph position="3"> When approximating the maximum of the likelihood function, the EM algorithm starts from a randomly generated initial estimate of/9 and then replaces /9 by the /9i which maximizes Q(/9/I/9)- This process is broken down into two steps: expectation (the E-step), and maximization (the M-step).</Paragraph>
      <Paragraph position="4"> The E-step finds the expected values of the sufficient statistics of the complete model using the current estimates of the model parameters. The M-step makes maximum likelihood estimates of the model parameters using the sufficient statistics from the E-step. These steps iterate until the parameter estimates/9 and/91 converge.</Paragraph>
      <Paragraph position="5"> The M-step is usually easy, assuming it is easy for the complete data problem; the E-step is not necessarily so. However, for decomposable models, such as the Naive Bayes, the E-step simplifies to the calculation of the expected counts in the marginal distributions of interdependent features, where the expectation is with respect to/9. The M-step simplifies to the calculation of new parameter estimates from these counts. Further, these expected counts can be calculated by multiplying the sample size N by the probability of the complete data within each marginal distribution given/9 and the observed data within each marginal Ym- This simplifies to: counti(Sm, Y,~) = P(SmIYm.) x count(Ym) where count i is the current estimate of the expected count and P(Sm \[Ym) is formulated using 0.</Paragraph>
    </Section>
    <Section position="2" start_page="198" end_page="198" type="sub_section">
      <SectionTitle>
3.2 Example
</SectionTitle>
      <Paragraph position="0"> For the Naive Bayes model with 3 observable fea- null tures A, B, C and an unobservable classification feature S, where 8 = {P(a, s), P(b, s), P(c, s), P(s)}, the E and M-steps are: 1. E-step: The expected values of the sufficient  statistics are computed as follows:</Paragraph>
      <Paragraph position="2"> 2. M-step: The sufficient statistics from the E null step are used to re-estimate the model parameters/9i: null</Paragraph>
      <Paragraph position="4"> where s, a, b, and c denote specific values of S, A, B, and C respectively, and P(slb) and P(s\]c) are defined analogously to P(sIa ).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="198" end_page="199" type="metho">
    <SectionTitle>
4 Experimental Procedure
</SectionTitle>
    <Paragraph position="0"> Experiments were conducted to disambiguate 13 different words using 3 different feature sets. In these experiments, each of the 3 unsupervised disambiguation methods is applied to each of the 13 words using each of the 3 feature sets; this defines a total of 117 different experiments. In addition, each experiment was repeated 25 times in order to study the variance introduced by randomly selecting initial parameter estimates, in the case of the EM algorithm, and randomly selecting among equally distant groups when clustering using Ward's and McQuitty's methods.</Paragraph>
    <Paragraph position="1">  In order to evaluate the unsupervised learning algorithms we use sense-tagged text in these experiments. However, this text is only used to evaluate the accuracy of our methods. The classes discovered by the unsupervised learning algorithms are mapped to dictionary senses in a manner that maximizes their agreement with the sense-tagged text.</Paragraph>
    <Paragraph position="2"> If the sense-tagged text were not available, as would often be the case in an unsupervised experiment, this mapping would have to be performed manually.</Paragraph>
    <Paragraph position="3"> The words disambiguated and their sense distributions are shown in Figure 3. All data, with the exception of the data for line, come from the ACL/DCI Wall Street Journal corpus (Marcus, Santorini, and Marcinkiewicz, 1993). With the exception of line, each ambiguous word is tagged with a single sense defined in the Longman Dictionary of Contemporary English (LDOCE) (Procter, 1978). The data for the 12 words tagged using LDOCE senses are described in more detail in (Bruce, Wiebe, and Pedersen, 1996).</Paragraph>
    <Paragraph position="4"> The line data comes from both the ACL/DCI WSJ corpus and the American Printing House for the Blind corpus. Each occurrence of line is tagged with a single sense defined in WordNet (Miller, 1995). This data is described in more detail in (Leacock, Towell, and Voorhees, 1993).</Paragraph>
    <Paragraph position="5"> Every experiment utilizes all of the sentences available for each word. The number of sentences available per word is shown as &amp;quot;total count&amp;quot; in Figure 3. We have reduced the sense inventory of these words so that only the two or three most frequent senses are included in the text being disambiguated.</Paragraph>
    <Paragraph position="6"> For several of the words, there are minority senses that form a very small percentage (i.e., &lt; 5%) of the total sample. Such minority classes are not yet well handled by unsupervised techniques; therefore we do not consider them in this study.</Paragraph>
  </Section>
  <Section position="7" start_page="199" end_page="201" type="metho">
    <SectionTitle>
5 Feature Sets
</SectionTitle>
    <Paragraph position="0"> We define three different feature sets for use in these experiments. Our objective is to evaluate the effect that different types of features have on the accuracy of unsupervised learning algorithms such as those discussed here. We are particularly interested in the impact of the overall dimensionality of the feature space, and in determining how indicative different feature types are of word senses. Our feature sets are composed of various combinations of the following five types of features.</Paragraph>
    <Paragraph position="1"> Morphology The feature M represents the morphology of the ambiguous word. For nouns, M is binary indicating singular or plural. For verbs, the value of M indicates the tense of the verb and can have up to 7 possible values. This feature is not used for adjectives.</Paragraph>
    <Paragraph position="2"> Adjective Senses ~'1 chief. (total count: 1048) highest in rank: 86 most important; main: 14~'c~ common: (total count: 1060) ! as in the phrase 'common stock': 84~ belonging to or shared by 2 or more: 8~ happening often; usual: 8deg~ lasl: (total count: 3004) on the occasion nearest in the past: 94deg~ after all others: 6c~ public: (total count: 715) concerning people in general: 68cX concerning the government and people: 19~ not secret or private: 13deg~</Paragraph>
    <Section position="1" start_page="199" end_page="199" type="sub_section">
      <SectionTitle>
Noun Senses
</SectionTitle>
      <Paragraph position="0"> bill: (total count: 1341) a proposed law under consideration: 68~ a piece of paper money or treasury bill: 22deg~ a list of things bought and their price: 10~ concern: (total count: 1235) | a business; firm: 64 % worry; anxiety: 36c~ drug: (total count: 1127) a medicine; used to make medicine: 57deg~ a habit-forming substance: 43% ! interest: (total count: 2113) \] money paid for the use of money: 59 % a share in a company or business: 24% readiness to give attention: 17% line: (total count: 1149) a wire connecting telephones: 37 % a cord; cable: 32% an orderly series: 30%</Paragraph>
    </Section>
    <Section position="2" start_page="199" end_page="199" type="sub_section">
      <SectionTitle>
Verb Senses
</SectionTitle>
      <Paragraph position="0"> agree: (total count: 1109) to concede after disagreement: 74~ to share the same opinion: 26% I close: (total count: 1354) to (cause to) end: 77% to (cause to) stop operation: 23% help: (total count: 1267) to enhance - inanimate object: 78deg~ to assist - human object: 22~ include: (total count: 1526) I to contain in addition to other parts: 91% to be a part of- human subject: 9~  Part-of-Speech Features of the form PLi represent the part-of-speech (POS) of the word i positions to the left of the ambiguous word. PRi represents the POS of the word i positions to the right. In these experiments, we used 4 POS features, PL1, PL2, PR1, and PR2 to record the POS of the words 1 and 2 positions to the left and right of the ambiguous word. Each POS feature can have one of 5 possible values: noun, verb, adjective, adverb or other.</Paragraph>
      <Paragraph position="1"> Co-occurrences Features of the form Ci are binary co-occurrence features. They indicate the presences or absences of a particular content word in the same sentence as the ambiguous word. We use 3 binary co-occurrence features, C1, C2, and Ca to represent the presences or absences of each of the three most frequent content words, C1 being the most frequent content word, C2 the second most frequent and C3 the third. Only sentences containing the ambiguous word were used to establish word frequencies. null Frequency based features like this one contain little information about low frequency classes. For words with skewed sense distribution, it is likely that the most frequent content words will be associated only with the dominate sense.</Paragraph>
      <Paragraph position="2"> As an example, consider the 3 most frequent content words occurring in the sentences that contain chi@ officer, executive and president. Chief has a majority class distribution of 86% and, not surprisingly, these three content words are all indicative of the dominate sense which is &amp;quot;highest in rank&amp;quot;. The set of content words used in formulating the co-occurrence features are shown in Figure 4. Note that million and company occur frequently. These are not likely to be indicative of a particular sense but more reflect the general nature of the Wall Street Journal corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="199" end_page="199" type="sub_section">
      <SectionTitle>
Unrestricted Collocations Features of the form
</SectionTitle>
      <Paragraph position="0"> ULi and URi indicate the word occurring in the position i places to the left or right, respectively, of the ambiguous word. All features of this form have 21 possible values. Nineteen correspond to the 19 most frequent words that occur in that fixed position in all of the sentences that contain the particular ambiguous word. There is also a value, (none), that indicates when the position i to the left or right is occupied by a word that is not among the 19 most frequent, and a value, (null), indicating that the position i to the left or right falls outside of the sentence boundary.</Paragraph>
      <Paragraph position="1"> In these experiments we use 4 unrestricted collocation features, UL2, UL1,UR1, and UR2. As an example, the values of these features for concern are as follows: * UL2: and, the, a, of, to, financial, have, because, an, 's, real, cause, calif., york, u.s., other, mass., german, (null), (none) * UL1 : the, services, of, products, banking, 's, pharmaceutical, energy, their, expressed, electronics, some, biotechnology, aerospace, environmental, such, japanese, gas, investment, (null), (none) * URI: about, said, that, over, 's, in, with, had, are, based, and, is, has, was, to, for, among, will, did, (null), (none) * URn: the, said, a, it, in, that, to, n't, is, which, by, and, was, has, its, possible, net, but, annual, (null), (none)</Paragraph>
    </Section>
    <Section position="4" start_page="199" end_page="201" type="sub_section">
      <SectionTitle>
Content Collocations Features of the form CL1
</SectionTitle>
      <Paragraph position="0"> and CR1 indicate the content word occurring in the position 1 place to the left or right, respectively, of the ambiguous word. The values of these features are defined much like the unrestricted collocations above, except that these are restricted to the 19 most frequent content words that occur only one position to the left or right of the ambiguous word.</Paragraph>
      <Paragraph position="1"> To contrast this set of features with the unrestricted collocations, consider concern again. The values of the features representing the 19 most frequent content words 1 position to the left and right are as follows: * CLI: services, products, banking, pharmaceutical, energy, expressed, electronics, biotechnology, aerospace, environmental, japanese, gas, Feature Sets A, B and C The 3 feature used in these experiments are designated A, B C and are formulated as follows: investment, food, chemical, broadcasting, u.s., industrial, growing, (null), (none) CRi: said, had, are, based, has, was, did, owned, were, regarding, have, declined, expressed, currently, controlled, bought, announced, reported, posted, (null), (none) sets and  The dimensionality is the number of possible combinations of feature values and thus the size of the feature space. These values vary since the number of possible values for M varies with the part-of-speech of the ambiguous word. The lower number is associated with adjectives and the higher with verbs. To get a feeling for the adequacy of these feature sets, we performed supervised learning experiments with the interest data using the Naive Bayes model. We disambiguated 3 senses using a 10:1 training-totest ratio. The average accuracies for each feature set over 100 random trials were as follows: A 80.9%, B 87.7%, and C 82.7%.</Paragraph>
      <Paragraph position="2"> The window size, the number of values for the POS features, and the number of words considered in the collocation features are kept deliberately small in order to control the dimensionality of the problem. In future work, we will expand all of the above types of features and employ techniques to reduce dimensionality along the lines suggested in (Duda and Hart, 1973) and (Gale, Church, and Yarowsky, 1995).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML