XML Viewer - w95-0104

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0104_metho.xml
Size: 33,887 bytes
Last Modified: 2025-10-06 14:14:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W95-0104">
  <Title>A Bayesian hybrid method for context-sensitive spelling correction</Title>
  <Section position="3" start_page="39" end_page="40" type="metho">
    <SectionTitle>
2 Context-sensitive spelling correction
</SectionTitle>
    <Paragraph position="0"> Context-sensitive spelling correction is the problem of correcting spelling errors that result in valid words in the lexicon. Such errors can arise for a. variety of reasons, including typos (e.g., out for our), homonym confusions (there for their), and usage errors (between for among). These errors are not detected by collventional spell checkers, as they only notice errors resulting in non-words.</Paragraph>
    <Paragraph position="1"> We treat context-sensitive spelling correction as a task of word disambiguation. The ambiguity among words is modelled by eonfusio~ sets. A confilsion set C = {wl,..., Wn} means that each word wi in the set is ambiguous with each other word in the set. Thus if C = {desert, dessert}, then when the spelling-correction program sees an occurrence of either desert or dessert in the target document, it takes it to be a.mbiguous between desert and dessert, and tries to infer fi'om the context which of the two it should be.</Paragraph>
    <Paragraph position="2"> This treatment requires a collection of confusion sets to start with. There are several ways to obtain such a collection. One is based on finding words in the dictionary that are one typo away from each other \[Mays et al., 1991\]. 1 Another finds words that have the same or similar pronunciations. Since this was not the focus of the work reported here, we simply took (most of) our confusion sets fl'om the list of &amp;quot;Words Commonly Confused&amp;quot; in the back of the Random House unabridged dictionary \[Flexner, 1983\].</Paragraph>
    <Paragraph position="3"> A final point concerns the two types of errors a spelling-correction program can make: false negatives (complaining about a correct word), and false positives (failing to notice an error). We will make the simplifying assumption that both kinds of errors are equally bad. In practice, however, false negatives are much worse, as users get irritated by programs that badger them with bogus complaints. However, given the probabilistic nature of the methods that will be presented below, it would not be hard to modify them to take this into account. We would merely set a confidence threshold, and report a suggested correction only if the probability of the suggested word exceeds the probability of the user's original spelling by at least the threshold amount. The reason this was not done in the work reported here is that setting this confidence threshold involves a certain subjective factor (which depends on the user's &amp;quot;irritability threshold&amp;quot;). Our simpli~,ing assumption allows us to measure performance objectively, by the single parameter of prediction accuracy.</Paragraph>
    <Paragraph position="4"> 1Constructing confllsion sets in this way requires assigning each word in the lexicon its own confusion set. For instance, cat might have the confusion set {hat, car,...}, hat might have {cat, had .... }, and so on. We cannot use the symmetric conflmion sets that we have adopted -- where every word in the set is confusable with every other one -- because the &amp;quot;confusable&amp;quot; relation is no longer transitive.</Paragraph>
  </Section>
  <Section position="4" start_page="40" end_page="51" type="metho">
    <SectionTitle>
3 Five methods for spelling correction
</SectionTitle>
    <Paragraph position="0"> This section presents a progression of five methods for context-sensitive spelling correction: Baseline An indicator of &amp;quot;minimal competency&amp;quot; for comparison with the other methods Context words Tests for particular words within ::t=k words of the ambiguous target word Collocations Tests for syntactic patterns around the ambiguous target word Decision lists Combines context words and collocations via decision lists Bayesian classifiers Combines context words and collocations via Bayesian classifiers.</Paragraph>
    <Paragraph position="1"> Each method will be described in terms of its operation on a single confusion set C = (Wl,..., w~}; that is, we will say how the method disambiguates occurrences of words wl through wn from the context. The methods handle multiple confusion sets by applying the same technique to each confusion set independently.</Paragraph>
    <Paragraph position="2"> Each method involves a training phase and a test phase. The performance figures given below are based on training each method on the 1-million-word Brown corpus \[Ku~:era and Francis, 1967\] and testing it on a 3/4-million-word corpus of Wall Street Journal text \[Marcus et al., 1993\].</Paragraph>
    <Section position="1" start_page="40" end_page="40" type="sub_section">
      <SectionTitle>
3.1 Baseline method
</SectionTitle>
      <Paragraph position="0"> The baseline method disambiguates words wl through wn by simply ignoring the context, and always guessing that the word should be whichever wi occurred most often in the training corpus.</Paragraph>
      <Paragraph position="1"> For instance, if C -- (desert, dessert}, and desert occurred more often than dessert in the training corpus, then the method will predict that every occurrence of desert or dessert in the test corpus should be changed to (dr left as) desert.</Paragraph>
      <Paragraph position="2"> Table 1 shows the performance of the baseline method for 18 confusion sets. This collection of confusion sets will be used for evaluating the methods throughout the paper. Each line of the table gives the results for one confusion set: the words in the confusion set; the number of instances of any word in the confusion set in the training corpus and in the test corpus; the word in the confusion set that occurred most often in the training corpus; and the prediction accuracy of the baseline method for the test corpus. Prediction accuracy is the number of times the correct word was predicted, divided by the total number of test cases. For example, the members of the confusion set {I, me} occurred 840 times in the test corpus, the breakdown being 744 I and 96 me. The baseline method predicted I every time, a.nd thus was right 744 times, for a score of 744/840 = 0.886.</Paragraph>
      <Paragraph position="3"> Essentially the baseline method measures how accurately one can predict words using just their prior probabilities. This provides a lower bound on the performance we would expect from the other methods, which use more than just the priors.</Paragraph>
    </Section>
    <Section position="2" start_page="40" end_page="43" type="sub_section">
      <SectionTitle>
3.2 Component method 1: Context words
</SectionTitle>
      <Paragraph position="0"> One clue about the identity of an ambiguous target word comes from the words around it. For instance, if the target word is ambiguous between desert and dessert, and we see words like arid, sand, and sun nearby, this suggests that the target word should be desert. On the other hand, words such as chocolate and delicious ill the context imply dessert. This observation is the basis for the method of context words. The idea is that each word wi in the confusion set will have a characteristic distribution of words that occur in its context; thus to classify an ambiguous target word, we look at the set of words around it and see which wi's distribution they most closely follow.</Paragraph>
      <Paragraph position="1">  whether, weather 331 245 I, me 6125 840 its, it's 1951 3575 past, passed 385 397 than, then 2949 1659 being, begin 727 449 effect, affect 228 162 your, you're 1047 212 number, amount 588 429 council, counsel 82 83 rise, raise 139 301 between, among 1003 730 led, lead 226 219 except, accept 232 95 peace, piece 310 61 there, their, they're 5026 2187 principle, principal 184 69 sight, site, cite 149 44  column gives the word in the confusion set that occurred most frequently in the training corpus. (In subsequent tables, confusion sets will be referred to by their most frequent word.) The &amp;quot;Baseline&amp;quot; column gives the prediction a,ccuracy of the baseline system on the test corpus. Following previous work \[Gale et al., 1994\], we formulate tile method in a Bayesian framework. The task is to pick the word wi that is most probable, given the context words cj observed within a =t:k-word window of the target word. The probability for each wi is calculated using Bayes' rule: p(wilc-k,..., C-1, el,..., Ck) = P(C-k'''&amp;quot; C-l' Cl''&amp;quot;' CklWi) p(wi) p(C-k,...,C-l,Cl,. ..,Ck) As it stands, the likelihood term, p(c-k .... , C-l, Cl,... , CklWi) , is difficult to estimate from training data -- we would have to count situations in which the entire context was previously observed around word wi, which raises a severe sparse-data problem. Instead, therefore, we assume that the presence of one word in the context is independent of the presence of any other word. This lets us decompose the likelihood into a product:</Paragraph>
      <Paragraph position="3"> Gale et al. \[1994\] provide evidence that this is in fact a reasonable approximation.</Paragraph>
      <Paragraph position="4"> We still have the problem, however, of estimating the individual p(cjlwi) probabilities from our training corpus. The straightforward way would be to use a. maximum likelihood estimate -- we  would count Mi, the total number of occurrences of wi in the training corpus, and mi, the number of such occurrences for which cj occurred within +-k words, and we would then take the ratio mi/~4i.2 Unfortunately, we may not have enough training data to get an accurate estimate this way. Gale et al. \[1994\] address this problem by interpolating between two maximum-likefihood estimates: one of p(cjlwi), and one of p(cj). The former measures the desired quantity, but is subject to inaccuracy due to sparse data; the latter provides a robust estimate, but of a potentially irrelevant quantity. Gale et al. interpolate between the two so as to minimize the overall inaccuracy.</Paragraph>
      <Paragraph position="5"> We have pursued an alternative approach to the problem of estimating the likelihood terms.</Paragraph>
      <Paragraph position="6"> We start with the observation that there is no need to use every word in the +-k-word window to discriminate among the words in the confusion set. If we do not have enough training data for a given word c to accurately estimate p(clwi ) for all w/, then we simply disregard e, and base our discrimination on other, more reliable evidence. We implement this by introducing a &amp;quot;minimum occurrences&amp;quot; threshold, Train. It is currently set to 10. We then ignore a context word c if: l&lt;i&lt;n l&lt;i&lt;n where mi and Mi are defined as above. In other words, e is ignored if it practically never occurs within the context of any wi, or if it practically always occurs within the context of every wi. In the former case, we have insufficient data to measure its presence; in the latter, its absence.</Paragraph>
      <Paragraph position="7"> Besides the reason of insufficient data, a second reason to ignore a context word is if it does not help discriminate among the words in the confusion set. For instance, if we are trying to decide between I and me, then the presence of the in the context probably does not help. By ignoring such words, we eliminate a source of noise in our discrimination procedure, as well as reducing storage requirements and run time. To determine whether a context word e is a useful discriminator, we run a chi-squa.re test \[Fleiss, 1981\] to check for an association between the presence of c and the choice of word in the confusion set. If the observed association is not judged to be significant, a then c is discarded. The significance level is currently set to 0.05.</Paragraph>
      <Paragraph position="8"> Figure 1 pulls together the points of the preceding discussion into an outline of the method of context words. In the training phase, it identifies a list of context words that are useful for discriminating among the words in the confusion set. At run time, it estimates the probability of each word in the confusion set. It starts with the prior probabilities, and multiplies them by the likelihood of each context word fl'om its list that appears in the +-k-word window of the target word. Finally, it selects the word in the confusion set with the greatest probability.</Paragraph>
      <Paragraph position="9"> The main parameter to tune for the method of context words is k, the half-width of the context window. Previous work \[Yarowsky, 1994\] shows that smaller values of k (3 or 4) work well for resolving local syntactic ambiguities, while larger values (20 to 50) are suitable for resolving semantic ambiguities. We tried the values 3, 6, 12, and 24 on some practice confusion sets (not shown here), and found that k = 3 generally did best, indicating that most of the action, for our task and confusion sets, comes fl'om local syntax. In the rest of this paper, this value of k will be used.</Paragraph>
      <Paragraph position="10"> =We are interpreting the condition &amp;quot;cj occurs within a =l=k-word window of wi&amp;quot; as a binary feature -- either it happens, or it does not. This allows us to handle context words in the same Bayesian framework as will be used later for other binary features (see Section 3.3). A more conventional interpretation is to take into account the number of occurrences of each cj within the ::l=k-word window, and to estimate p(cjlwi ) accordingly. However, either interpretation is valid, as long as it is applied consistently -- that is, both when estimating the likelihoods from training data, and when classifying test. cases.</Paragraph>
      <Paragraph position="11"> 3An association is significant if the probability that it occurred by chance is low. This is not a statement about the strength of the association. Even a weak association may be judged significant if there are enough data to support it. Measures of the strength of association will be discussed in Section 3.4.</Paragraph>
      <Paragraph position="12">  (1) Propose all words as candidate context words.</Paragraph>
      <Paragraph position="13"> (2) Count occurrences of each candidate context word in the training corpus.</Paragraph>
      <Paragraph position="14"> (3) Prune context words that have insufficient data or are uninformative discriminators.</Paragraph>
      <Paragraph position="15"> (4) Store the remaining context words (and their associated statistics) for use at run time.</Paragraph>
      <Paragraph position="16"> Run time (1) Initialize the probability for each word in the confusion set to its prior probability.</Paragraph>
      <Paragraph position="17"> (2) Go through the list of context words that was saved during training. For each context word that appears in the context of the ambiguous target word, update the probabilities.</Paragraph>
      <Paragraph position="18"> (3) Choose the word in the confusion set with the highest probability.</Paragraph>
      <Paragraph position="19">  Table 2 shows the effect of varying k for our usual collection of confusion sets. It can be seen that performance generally degrades as k increases. The reason is that the method starts picking up spurious correlations in the training corpus. Table 4 gives some examples of the context words learned for the confusion set {peace, piece}, with k = 24. The context words coTTs, united, nations, etc., all imply peace, and appear to be plausible (although united and nations are a counterexample to our earlier assumption of independence). On the other hand, consider the context word how, which allegedly also implies peace. If we look back at the training corpus for the supporting data for this word, we find excerpts such as: But oh, how I do sometimes need just a moment of rest, and peace ...</Paragraph>
      <Paragraph position="20"> No matter how earnest is our quest for guaranteed peace ...</Paragraph>
      <Paragraph position="21"> How best to destroy your peace ? There does not seem to be a necessary connection here between how and peace; the correlation is probably spurious. Although we are using a chi-square test expressly to filter out such spurious correlations, we can only expect the test to catch 95% of them (given that the significance level was set to 0.05). As mentioned above, most of the legitimate context words show up for small k; thus as k gets large, the limited number of legitimate context words gets overwhelmed by the 5% of the spurious correlations that make it through our filter.</Paragraph>
    </Section>
    <Section position="3" start_page="43" end_page="45" type="sub_section">
      <SectionTitle>
3.3 Component method 2: Collocations
</SectionTitle>
      <Paragraph position="0"> The method of context words is good at capturing generalities that depend on the presence of nearby words, but not their order. When order matters, other more syntax-based methods, such as collocations and trigrams, are appropriate. In the work reported here, the method of collocations was used to capture order dependencies. A collocation expresses a pattern of syntactic elements around the target word. We allow two types of syntactic elements: words, and part-of-speech tags.</Paragraph>
      <Paragraph position="1"> Going back to the {desert, dessert} example, a collocation that would imply desert might be:  window. The bottom line of the table shows the number of context words learned, averaged over all confusion sets, also as a function of k.</Paragraph>
      <Paragraph position="2"> This collocation would match the sentences: Travelers entering from the desert were confounded...</Paragraph>
      <Paragraph position="3"> ... along with some guerrilla fighting in the desert.</Paragraph>
      <Paragraph position="4"> ...two ladies who lay pinkly nude beside him in the desert ...</Paragraph>
      <Paragraph position="5"> Matching part-of-speech tags (here, PREP) against the sentence is done by first tagging each word in the sentence with its set of possible part-of-speech tags, obtained from a dictionary. For instance, walk has the tag set {NS, V}, corresponding to its use as a singular noun and as a verb. 4 For a tag to match a word, the tag must be a member of the word's tag set. The reason we use tag sets, instead of running a tagger on the sentence to produce unique tags, is that taggers need to look at all words in the sentence, which is impossible when the target word is taken to be ambiguous (but see the trigram method in Section 4).</Paragraph>
      <Paragraph position="6"> The method of collocations was implemented in much the same way as the method of context words. The idea. is to discriminate among the words wi in the confusion set by identifying the collocations that tend to occur around each wi. An ambiguous target word is then classified by finding all collocations that match its context. Each collocation provides some degree of evidence 4Our tag inventory contains 40 tags, and includes the usual categories for determiners, nouns, verbs, modals, etc., a few specialized tags (for be, have, and do), and a dozen compound tags (such as V+PRO for let's).  for each word in the confusion set. This evidence is combined using Bayes' rule. In the end, the wi with the highest probability, given the evidence, is selected.</Paragraph>
      <Paragraph position="7"> A new complication arises for collocations, however, in that collocations, unlike context words, cannot be assumed independent. Consider, for example, the following collocations for desert: PREP the in the the __ These collocations are highly interdependent -- we will say they conflict. To deal with this problem, we invoke our earlier observation that there is no need to use all the evidence. If two pieces of evidence conflict, we simply eliminate one of them, and base our decision on the rest of the evidence. We identify conflicts by the heuristic that two collocations conflict iff they overlap. The overlapping portion is the factor they have in common, and thus represents their lack of independence. This is only a heuristic because we could imagine collocations that do not overlap, but still conflict. Note, incidentally, that there can be at most two non-conflicting collocations for any decision -one matching on the left-hand side of the target word, and one on the right.</Paragraph>
      <Paragraph position="8"> Having said that we resolve conflicts between two collocations by eliminating one of them, we still need to specify which one. Our approach is to assign each one a strength, just as Yarowsky \[1994\] does in his hybrid method, and to eliminate the one with the lower strength. This preserves the strongest non-conflicting evidence as the basis for our answer. The strength of a collocation reflects its reliability for decision-making; a further discussion of strength is deferred to Section 3.4. Figure 2 ties together the preceding discussion into an outline of the method of collocations. The method is described in terms of &amp;quot;features&amp;quot; rather than &amp;quot;collocations&amp;quot; to reflect its full generality; the features could be context words as well a.s collocations. In fact, the method subsumes the method of context words -- it does everything that method does, and resolves conflicts among its features as well. To facilitate the conflict resolution, it sorts the features by decreasing strength. Like the method of context words, the method of collocations has one main parameter to tune: e, the maximum number of syntactic elements in a collocation. Since the number of collocations grows exponentially with e, it was only practical to vary g from 1 to 3. We tried this on some practice confusion sets, and found that all values of g gave roughly comparable performance. We selected g = 2 to use from here on, as a compromise between reducing the expressive power of collocations (with g = 1) and incurring a high computational cost (with g = 3).</Paragraph>
      <Paragraph position="9"> Table 3 shows the results of varying f for the usual confusion sets. There is no clear winner; each value of g did best for certain confusion sets. Table 5 gives examples of the collocations learned for {peace, piece} with g = 2. A good deal of redundancy can be seen among the collocations. There is also some redundancy between the collocations and the context words of the previous section (e.g., for corps). Many of the collocations a.t the end of the list appear to be overgeneral and irrelevant.</Paragraph>
    </Section>
    <Section position="4" start_page="45" end_page="49" type="sub_section">
      <SectionTitle>
3.4 Hybrid method 1: Decision lists
</SectionTitle>
      <Paragraph position="0"> Yarowsky \[1994\] pointed out the complementarity between context words and collocations: context words pick up those generalities that are best expressed in an order-independent way, while collocations capture order-dependent generalities. Yarowsky proposed decision lists as a way to get the best of both methods. The idea is to make one big list of all features -- in this case, context words and collocations. The features are sorted in order of decreasing strength, where the strength of a feature reflects its reliability for decision-making. An ambiguous target word is then classified by running down the list and matching each feature against the target context. The first feature that  Initialize the probability for each word in the confusion set to its prior probability.</Paragraph>
      <Paragraph position="1"> Go through the sorted list of features that was saved during training. For each feature that matches the context of the ambiguous target word, and does not conflict with a feature accepted previously, update the probabilities.</Paragraph>
      <Paragraph position="2"> Choose the word in the confiision set with the highest probability.</Paragraph>
      <Paragraph position="3">  highlighted in boldface. The method is described in terms of &amp;quot;features&amp;quot; rather than &amp;quot;collocations&amp;quot; to reflect its full generality.</Paragraph>
      <Paragraph position="4"> matches is used to classify the target word. Yarowsky \[1994\] describes further refinements, such as detecting and pruning features that make a zero or negative contribution to overall performance. The method of decision lists, as just described, is almost the same as the method for collocations in Figure 2, where we take &amp;quot;features&amp;quot; in that figure to include both context words and collocations. The main difference is that during evidence gathering (step (2) at run time), decision lists terminate after matching the first feature. This obviates the need for resolving conflicts between features. Given that decision lists base their answer for a problem on the single strongest feature, their performance rests heavily on how the strength of a feature is defined. Yarowsky \[1994\] used the following metric to calculate the strength of a feature f:</Paragraph>
      <Paragraph position="6"> This is for the case of a confusion set of two words, wl and w2. It can be shown that this metric produces the identical ranking of features as the following somewhat simpler metric, provided p(wi\]f) &gt; 0 for all i: s</Paragraph>
      <Paragraph position="8"> As an example of using tile metric, suppose f is the context word arid, and suppose that arid co-occurs 10 times with desert and 1 time with dessert in the training corpus. Then reliability~(f) = max(10/11, 1/11) = 10/11 = 0.909. This value measures the extent to which the presence of the feature is unambiguously correlated with one particular wi. It can be thought of as the feature's reliability at picking out that wi fi'om the others in the confusion set.</Paragraph>
      <Paragraph position="9"> Sin fact, we guarantee that this inequality holds by performing smoothing before calculating strength. We smooth the data by adding 1 to the count of how many times each feature was observed for each wi.</Paragraph>
      <Paragraph position="10">  collocation. The bottom line of the table shows the number of collocations learned, averaged over all confusion sets, also as a function of e.</Paragraph>
      <Paragraph position="11"> One peculiar property of the reliability metric is that it ignores the prior probabilities of the words in the confusion set. For instance, in the arid example, it would award the same high score even if the total number of occurrences of desert and dessert in the training corpus were 50 and 5, respectively -- in which case arid's 1)erformance of 10/11 would be exactly what one would expect by chance, and therefore hardly impressive. Besides the reliability metric, therefore, we also considered an alternative metric: the uncertainty coefficient of x, denoted U(xIy ) \[Press et al., 1988, p.501\]. U(xly ) measures how much additional information we get about the presence of the feature by knowing the choice of word in the confusion set. 6 U(xly ) is calculated as follows:</Paragraph>
      <Paragraph position="13"> The probM)ilities are calculated for the population consisting of all occurrences in the training corpus of any wi. For instance, p(f) is the probability of feature f being present within this  words learned for {peace, piece} with k = 24.</Paragraph>
      <Paragraph position="14"> Each line gives a context word, and the number of peace and piece occurrences for which that context word occurred within +-k words.</Paragraph>
      <Paragraph position="15"> The last line of the table gives the total number of occurrences of peace and piece in the training corpus.</Paragraph>
      <Paragraph position="16"> Table 5: Excerpts from the sorted list of 98 collocations learned for {peace, piece} with = 2. Each line gives a collocation, and the number of peace and piece occurrences it matched. The last line of the table gives the total number of occurrences of peace and piece in the training corpus.</Paragraph>
      <Paragraph position="17">  population. Applying tim U(x\]y) metric to the arid example, the value returned now depends on the number of occurrences of desert and dessert in the training corpus. If these numbers are 50 and 5, then U(xly ) = 0.0, reflecting the mfinformativeness of the arid feature in this situation. If instead the numbers are 50 and 500, then U(xly ) = 0.402, indicating arid's better-than-chance ability to pick out desert (10 out of 50 occurrences) over dessert (1 out of 500 occurrences). To compare the two strength metrics, we tried both on some practice confusion sets. Sometimes one metric did sul)stantially better, sometimes the other. In the balance, the reliability metric seemed to give higher performance. This metric is therefore the one that will be used from here on. It was also used for all experiments involving the method of collocations. Table 6 shows the performance of decision lists with each metric for the usual confusion sets. As with the practice confusion sets, we see sometimes dramatic performance differences between the two metrics, and no clear winner. For instance, for {I, me}, the reliability metric did better than U(xly) (0.980 versus 0.808); whereas for {between, among}, it did worse (0.659 versus 0.800). Further research is needed to understand the circumstances under which each metric performs best. Focusing for now on the reliability metric, Table 6 shows that the method of decision lists does, by and large, accomplish what it set out to do -- namely, outperform either component method alone. There axe, however, a few cases where it falls short; for instance, for {between, among}, decision lists score only 0.659, compared with 0.759 for context words and 0.730 for collocations. 7 We believe that the problem lies in the strength metric: because decision lists make their judgements based on a single piece of evidence, their performance is very sensitive to the metric used to select that piece of evidence. But as the relial)ility and U(x\[y) metrics indicate, it is not completely clear how the metric should be defined. This problem is addressed in the next section.</Paragraph>
    </Section>
    <Section position="5" start_page="49" end_page="51" type="sub_section">
      <SectionTitle>
3.5 Hybrid method 2: Bayesian classifiers
</SectionTitle>
      <Paragraph position="0"> The previous section confirmed that decision lists are effective at combining two complementary methods -- context words and collocations. In doing the combination, however, decision lists look only at the single strongest piece of evidence for a given problem. We hypothesize that even better performance can be obtained by ta.king into account all available evidence. This section presents a method of doing this based on Bayesian classifiers.</Paragraph>
      <Paragraph position="1"> Like decision lists, the Bayesian method starts with a list of all features, sorted by decreasing strength. It classifies a.n ambiguous target word by matching each feature in the list in turn against the target context. Instead of stopping at the first matching feature, however, it traverses the entire list, combining evidence fi'om all matching features, and resolving conflicts where necessary.</Paragraph>
      <Paragraph position="2"> This method is essentially the same as the one for collocations (see Figure 2), except that it uses context words as well as collocations for the features. The only new wrinkle is in checking for conflicts between features (in step (2) a.t run tilne), as there are now two kinds of features to consider. If both features are context words, we say the features never conflict (as in the method of context words). If both features are collocations, we say they conflict iff they overlap (as in the method of collocations). The new case is if one feature is a context word, and the other is a collocation. Consider, for example, the context word walk, and the following collocations:  To some extent, all of these collocations conflict with walk. Collocation (1) is the most blatant case; if it matches the target context, this logically implies that the context word walk will match. If collocation (2) matches, this guarantees that one of the possible tags of walk will be present nearby the target word, thereby elevating the probability that walk will match within :5k words.</Paragraph>
      <Paragraph position="3"> If collocation (3) matches, this guarantees that there are two positions nearby the target word that are incompatible with walk, thereby reducing the probability that walk will match. If we were to treat all of these cases as conflicts, we would end up losing a great deal of (potentially useful) evidence. Instead, we adopt the more relaxed policy of only flagging the most egregious conflicts -- here, the one between collocation (1) and walk. In general, we will say that a collocation and a context word conflict iff the collocation contains an explicit test for the context word.</Paragraph>
      <Paragraph position="4"> Table 7 compares all methods covered so far -- baseline, two component methods, and two hybrid methods. (A sixth method, trigrams, is included as well -- it will be discussed in Section 4.) The table shows that the Bayesian hyt)rid method does at least as well as the previous four methods for almost every confusion set. Occasionally it scores slightly less than collocations; this appears to be due to some averaging effect where noisy context words are dragging it down. Occasionally too it scores less than decision lists, 1)ut never by much; on the whole, it yields a modest but consistent improvement, and in the case of {between, among}, a sizable improvement. We believe the improvement is due to considering all of the evidence, rather than just the single strongest piece, which makes the method more robust to inaccurate judgements about which piece of evidence is &amp;quot;strongest&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML