File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2100_intro.xml

Size: 3,334 bytes

Last Modified: 2025-10-06 14:05:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2100">
  <Title>Good Bigrams</Title>
  <Section position="3" start_page="0" end_page="592" type="intro">
    <SectionTitle>
2 Definitions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="592" type="sub_section">
      <SectionTitle>
2.1 Mutual information
</SectionTitle>
      <Paragraph position="0"> In the following p(x) will denote the observed probability as defined by p(x)=F(x)/N where F(x) is the frequency of occurrence of x, and N is the number of observed cases. N is, in the calculations, equal to the corpus size in words. Given this, the mutual information ratio (Church &amp; Hanks, 1990; Church &amp; Mercer, 1993; Steier &amp; Belew, 1991) is expressed by Formula 1. (Church &amp; Hanks refer to this measure as the association ratio tbr technical reasons).</Paragraph>
      <Paragraph position="2"> The instability of statistical measures seems to be a problem in statistical bigralns. Especially low frequency counts cause instability. To avoid this use the rule of thumb that a bigram must occur more than four times (cf. Church &amp; Hanks, 1990:p.24) to be considered as a candidate/br an interesting bigram.</Paragraph>
    </Section>
    <Section position="2" start_page="592" end_page="592" type="sub_section">
      <SectionTitle>
2.2 The difference in mutual informa-
</SectionTitle>
      <Paragraph position="0"> tion: temporal co-occurrence A reasonable way of using the temporal orde~ ring in word pairs is to consider the opposite ordering of the word pair as negative evidence against the present order. A reasonable measure would be to use the difference in mutual information between the two orderings, hereafter Ag. The size of the corpus cancels out and Ag can be calculated by a ratio between frequencies. This is intuitively correct for a comparison between apples and pears, i.e.</Paragraph>
      <Paragraph position="1"> you can say that apples (wl w2) occur twice as often as pears (w2 w l) in my fruit bowl (corpus). (p is the probability in the fixed corpus (fiN) which is different fi'om the probability in the language. It is impossible to have a fixed corpus that equals the language since language does not have a fixed number of words or word patterns).</Paragraph>
      <Paragraph position="2">  In the case that the reversed ordering of a word pair has not been observed in the corpus, the measure becomes undefined. To relieve this the frequency t is multiplied by a constant (10), and the frequency of the reversed ordering is set to 1. Subtracting 9 from that value does not add anything to the measure for a single occurrence (log(10-9)=0).</Paragraph>
      <Paragraph position="3"> Other ways of handling zero-frequencies are evaluated in (Gale &amp; Church, 1994), e.g.</Paragraph>
      <Paragraph position="4"> the Good-Turing method. Relative frequencies of non-observed word pairs are hard to estimate. For example, the frequencies of frequencies (X) and frequency (Y) used in the 1 1 will use 'frequency' as equivalent to 'occurrence' in the sample corpus.</Paragraph>
      <Paragraph position="5"> Good-Turing method are linearly dependent in a log-log scale, i.e., there is an infinite frequency of non-observed items (which is another way of saying that we cannot expect the unexpected).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML