File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2231_intro.xml

Size: 3,871 bytes

Last Modified: 2025-10-06 14:06:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2231">
  <Title>Structural Disambiguation Based on Reliable Estimation of Strength of Association Haodong Wu Eduardo de Paiva Alves</Title>
  <Section position="2" start_page="0" end_page="1416" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The strength of association between words provides lexical preferences for ambiguity resolution. It is usually estimated from statistics on word co-occurrences in large corpora (Hindle and Rooth, 1993). A problem with this approach is how to estimate the probability of word co-occurrences that are not observed in the training corpus. There are two main approaches to estimate the probability: smoothing methods (e.g., Church and Gale, 1991; Jelinek and Mercer, 1985; Katz, 1987) and class-based methods (e.g., Brown et al., 1992; Pereira and Tishby, 1992; Resnik, 1992; Yarowsky, 1992).</Paragraph>
    <Paragraph position="1"> Smoothing methods estimate the probability of the unobserved co-occurrences by using frequencies of the individual words. For exampie, when eat and bread do not co-occur, the probability of (eat, bread) would be estimated by using the frequency of (eat) and (bread).</Paragraph>
    <Paragraph position="2"> A problem with this approach is that it pays no attention to the distributional characteristics of the individual words in question. Using this method, the probability of (eat, bread&gt; and (eat, cars) would become the same when bread and cars have the same frequency. It is unacceptable from the linguistic point of view.</Paragraph>
    <Paragraph position="3"> Class-based methods, on the other hand, estimate the probabihties by associating a class with each word and collecting statistics on word class co-occurrences. For instance, instead of calculating the probability of (eat, bread) directly, these methods associate eat with the class \[ingest\] and bread with tile class \[food\] and collect statistics on the classes \[ingest\] and \[food\]. The accuracy of the estimation depends on the choice of classes, however. Some class-based methods (e.g., Yarowsky, 1992) associate each word with a single class without considcring the other words in the co-occurrence. However, a word may need to be replaced by different class depending on the co-occurrence. Some classes may not have enough occurrences to allow a reliable estimation, while other classes may be too general and include too many words not relevant to the estimation. An alternative is to obtain various classes associated in a taxonomy with the words in question and select the classes according to a certain criteria.</Paragraph>
    <Paragraph position="4"> There are a number of ways to select the classes used in the estimation. Weischedel et al.</Paragraph>
    <Paragraph position="5"> (1993) chose the lowest classes in a taxonomy  for which the association for the co-occurrence can be estimated. This approach may result in unreliable estimates, since some of the class co-occurrences used may be attributed to chance.</Paragraph>
    <Paragraph position="6"> Resnik (1993) selected all pairs of classes corresponding to the head of a prepositional phrase and weighted them to bias the computation of the association in favor of higher-frequency co-occurrences which he considered &amp;quot;more reliable.&amp;quot; Contrary to this assumption, high frequency co-occurrences axe unreliable when the probability that the co-occurrence may be attributed to chance is high.</Paragraph>
    <Paragraph position="7"> In this paper we propose a class-based method that selects the lowest classes in a taxonomy for which the co-occurrence confidence is above a threshold. We subsequently apply the method to solving structural ambiguities in Japanese dependency structures and English prepositional phrase attachments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML