File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2100_metho.xml

Size: 13,189 bytes

Last Modified: 2025-10-06 14:14:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2100">
  <Title>Good Bigrams</Title>
  <Section position="4" start_page="592" end_page="593" type="metho">
    <SectionTitle>
3 Illustration
</SectionTitle>
    <Paragraph position="0"> The difference between the two measures are perhaps best illustrated with some concrete examples. In a previous paper (Johansson, 1.994) &amp;quot;Alice's adventures in Wonderhmd&amp;quot; (AIW) was used as an experimental corpus to compare phrase finding for ~t, and a new measure -- A~t. A critique against that corpus is that the corpus is very small. &amp;quot;Through the Looking Glass&amp;quot; and &amp;quot;The Hunting of the Snark&amp;quot; extend that corpus to about 63 000 words of which 26 831 occurred more than 4 times. With the criterion that an interesting bi-gram occurs more than 4 times 1970 bigram candidates were found in this larger corpus.</Paragraph>
    <Paragraph position="1">  \[n the previous table the effect is measured by the number of steps a bigram is moved up compared to a sorted frequency list. The effect of mutual information under these conditions is higher than the proposed measure for finding most characters in A1W, except for some names defined by definite article + noun, and common adjective + noun.</Paragraph>
  </Section>
  <Section position="5" start_page="593" end_page="593" type="metho">
    <SectionTitle>
4 Material 6 Results
</SectionTitle>
    <Paragraph position="0"> In the rest of this paper, the corpus is the SUSANNE corpus (Sampson, 1994). This corpus consists of an extensively tagged and annotated subset from the Brown Corpus of American English. The corpus is fairly small, but provides information on grammatical roles on the word and phrase level. This makes the SUSANNE corpus suitable for further research. null The SUSANNE corpus is divided into 4 (approximately equally large) genre subcategones: null &amp;quot;A: press reportage G: belles lettres, biography, memoirs J: learned (mainly scientific and technical) writing N: adventure and Western fiction&amp;quot; (Sampson, 1994:p. 1.74) Each genre has approximately 20,000 unique word pairs 2. The four genres will be used as one factor in the comparison between different measures. The question is whether the genre interacts with the ability of the different measures to discover bigrams. In category A 439 unique bigrams (occurring more than 4 times) were found, in G 486, in J 598, N 620, and 2573 for the used corpus 3.</Paragraph>
  </Section>
  <Section position="6" start_page="593" end_page="595" type="metho">
    <SectionTitle>
5 Method
</SectionTitle>
    <Paragraph position="0"> The highest ranking bigralns according to the measure are sampled at 5 different levels: the 10, 50, 100, 200 and 400 top collocations.</Paragraph>
    <Paragraph position="1"> Samples are sorted and compared for overlap by the UNIX command 'comm -12 SAMPLE1 SAMPLE2 I wc -1', and the percentage of overlap was calculated from the size of the sample.</Paragraph>
    <Paragraph position="2"> Stability of bigrams was tested by three different overlaps. 1) The overlap between samples from genres, and samples for the entire corpus for the same measure. 2) The overlap between different measures at the five different levels for the different genres and the entire corpus. 3) The overlap between different genres.</Paragraph>
    <Section position="1" start_page="593" end_page="593" type="sub_section">
      <SectionTitle>
6.1 Mutual Information
</SectionTitle>
      <Paragraph position="0"> The average overlap between genres and the corpus showed that the J sample was much more stabile than the other genres 4. The J genre would be the genre that information retrieval applications would be most interested in. The ranking of the genres according to the stability of the overlap is: JANG. The highest collocations are most stabile for J, where the other genres show less specificity (i.e. equal or growing percentages as the overlap grows).</Paragraph>
    </Section>
    <Section position="2" start_page="593" end_page="593" type="sub_section">
      <SectionTitle>
6.2 Delta Mutual Information
</SectionTitle>
      <Paragraph position="0"> Delta mutual information shows little effect of genre, and sample size. Growing sample size predicts less overlap. The ranking of genres is: GANJ. Delta mutual information seems to rank the less specific genres high.</Paragraph>
      <Paragraph position="1">  A factorial ANOVA on measure and genre shows that there is a significant effect (p&lt;0.001) of measure (Ag or g), genre and interaction between measures. F(measure, 1df)=136.2, F(genre, 3df)=9.8, F(measure, genre, 1, 3)=15.4, p &lt;0.001. These two measures are significantly different.</Paragraph>
    </Section>
    <Section position="3" start_page="593" end_page="594" type="sub_section">
      <SectionTitle>
6.3 Occurrence
</SectionTitle>
      <Paragraph position="0"> The results for the samples are similar to a m The overlap is generally higher for occurrence than Ag, but the ranking of genres is the same: GANJ. An ANOVA on measure (Ag and occurrence) and genre show less significant effect on measure, and no significant effect of genre, or interaction (these measures behave in the same direction).</Paragraph>
      <Paragraph position="1"> 4In preliminary investigations the J genre was the least stabile genre for mutual information. This was 'corrected' by the demand that candidate bigrams should occur more than 4 times.</Paragraph>
      <Paragraph position="2">  more stabile than the other measure, but there is only a small difference of genres (occurrence and Ag react in a similar way to genre -- i.e. on high occurrence).</Paragraph>
    </Section>
    <Section position="4" start_page="594" end_page="594" type="sub_section">
      <SectionTitle>
6.4 Comparison between measures
</SectionTitle>
      <Paragraph position="0"> The overlap between measures is calculated for all combinations of measures. At the higher levels a high overlap can be expected since there is little possibility to fall out (e.g. in A 400 out of 439 is 91% of the sample).</Paragraph>
      <Paragraph position="1"> The results from this test indicate that the overlap between D (Ag) and F (occurrence) is significantly and consistently higher than between the other combinations (especially for the entire corpus).</Paragraph>
    </Section>
    <Section position="5" start_page="594" end_page="594" type="sub_section">
      <SectionTitle>
6.5 Overlap between genres
</SectionTitle>
      <Paragraph position="0"> To estimate the overlap of the genres the number of common bigrams between two genres were found and compared to the size of the smallest genre. The results indicate an average overlap between the genres of 10%.</Paragraph>
    </Section>
    <Section position="6" start_page="594" end_page="595" type="sub_section">
      <SectionTitle>
6.6 Reduction of the bigrams
</SectionTitle>
      <Paragraph position="0"> The bigrams that are rated high by the measures (especially mutual information) are mixed between two different types of bigrams: (1) bigrams with high internal cohesion between low frequency items that may be associated with a specific interpretation (e.g. &amp;quot;carbon tetrachloride&amp;quot; or &amp;quot;cheshire cat&amp;quot;), (2) bigrams with high internal cohesion with usually high frequency of both items that may be associated with a &amp;quot;syntactical&amp;quot; interpretation (e.g. &amp;quot;in the&amp;quot;).</Paragraph>
      <Paragraph position="1"> To separate type l from type 2 some information about the overlap of genres might be used. The type 2 bigrams are typically found in most genres, whereas type 1 bi-grams are specific to a text. The results above indicate that we can use the genres with least overlap to filter out common bigrams (i.e. A use J, G use J, J use N, N use J).</Paragraph>
      <Paragraph position="2"> In the following table the effect of the genre (column 2) is shown by the number of 'surviving' bigrams from the candidate bi-grams (column 1). The third column shows the effect of removing the bigrams that occur (more than 4 times) in both directions after common bigrams have been removed (first parenthesis shows actual removed, second shows those that would have been removed (i.e. those bigrams with both orderings in the candidate set). The fourth column shows the effect of removing bigrams that contains words that occur more than 4 times in the rest of the corpus (i.e. in A G N for J) after the bigrams have been formed. The reason for filtering after forming bigrams is that words that are filtered out later work as place holders, and prevent some bigrams to form. The reduction is most notable for removing bi-grams that contain common words between genres: genre G and N contain few good candidates of collocations type 1.</Paragraph>
      <Paragraph position="3"> Cand. Genre Word order filter Freq.</Paragraph>
      <Paragraph position="4">  The following bigrams survived the harshest condition of removing bigrams containing words of other genres. (Genre J, later ordered by mutual information). Some good candidates were (of course) removed, e.g. &amp;quot;black body&amp;quot;, &amp;quot;per cent&amp;quot;, &amp;quot;united states&amp;quot;.</Paragraph>
    </Section>
    <Section position="7" start_page="595" end_page="595" type="sub_section">
      <SectionTitle>
7.3 tax bill
</SectionTitle>
      <Paragraph position="0"> Genres G and N contain few candidates for collocations (among the 'best' ones in N were &amp;quot;gray eyes&amp;quot;, &amp;quot;picked up&amp;quot;, &amp;quot;help me&amp;quot; and &amp;quot;stared at&amp;quot; which are quite telling about the prototypical western story: &amp;quot;The gray eyes stared at the villain who picked up his knife, while the girl cried &amp;quot;help me&amp;quot;.&amp;quot;</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="595" end_page="596" type="metho">
    <SectionTitle>
7 Other approaches
</SectionTitle>
    <Paragraph position="0"> The temporal dependencies of an ordered collocation \[wordl, word2\] has been seen as a problem since the theory of mutual information assumes the frequencies of word pairs to be symmetric (i.e., f(\[wl, w2\]) and f(\[w2, w 1\]) to be equal). Delta mutual information relies on this difference in temporal ordering.</Paragraph>
    <Paragraph position="1"> &amp;quot;\[...\] f(x, y) encodes linear precedence. \[...\] Although we could fix this problem by redefining f(x, y) to be symmetric (by averaging the matrix with its transpose), we have decided not to do so, since order information appears to be very interesting.&amp;quot; (Church &amp; Hanks, 1990:p.24) Merkel, Nilsson, &amp; Ahrenberg (1994) have constructed a system that uses frequency of recurrent segments to determine long phrases.</Paragraph>
    <Paragraph position="2"> In their approach they have to chunk the text into contiguous segments. Significant frequency counts are achieved through the use of a very large corpus, and/or a corpus specialised for a specific task. They report that it was possible for them to divide a large corpus into smaller sub-sections with little loss.</Paragraph>
    <Paragraph position="3"> Smadja (1993)finds significant bigrams using an estimate of z-score (deviation from an expected mean). Smadja's method seems to require very large corpora, since the method needs to estimate a reliable measure of the variance of the frequencies with which words co-occur. This makes the method dependent on the corpus size. Smadja reports the use of a corpus of size 10 million words.</Paragraph>
    <Paragraph position="4"> &amp;quot;More precisely, the statistical methods we use do not seem to be effective on low frequency words (fewer than 100 occurrences).&amp;quot; (Smadja, 1993:p.168) Kita &amp; al. (1994) proposed another measure of collocational strength that was based on the notion of a reduction in 'processing cost' if a frequent chunk of text can be processed as one chunk. Cost reduction tended to extract conventional 'predicate phrase patterns', e.g., &amp;quot;is that so&amp;quot; and &amp;quot;thank you very much&amp;quot;. Steier &amp; Belew (1991) discuss the 'exporting' of phrases into a general vocabulary, where a word pair with high mutual information within a topic tends to have lower mutual information within the collection, and vice versa. They relate a higher mutual information within a topic than in the collection to a lower value of discrimination.</Paragraph>
    <Paragraph position="5"> Church &amp; Gale (1995) have found it useful to compare the distribution of terms across documents. They showed that a distribution different from what could be expected by a (random) Poisson process indicates interesting terms. This approach is similar to the use of one genre to find interesting items in  another. However, removal of the overlap needs some knowledge about the genres -apart from checking explicitly for a genre with least overlap. Cancelling overlap has the advantage that it can cancel out similar underlying causes, while it exaggerates the underlying causes that differ between genres. Some questions remain: at which level should overlap be formed? overlap in words or in bigrams; how many repetitions does it take for a word or bigram to 'belong' to a genre?</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML