XML Viewer - p98-2182

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2182_metho.xml
Size: 14,121 bytes
Last Modified: 2025-10-06 14:15:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2182">
  <Title>Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction</Title>
  <Section position="4" start_page="1110" end_page="1110" type="metho">
    <SectionTitle>
3 Statistics for selecting and ranking
</SectionTitle>
    <Paragraph position="0"> R&amp;S used the same figure of merit both for selecting new seed words and for ranking words in the final output. Their figure of merit was simply the ratio of the times the noun coocurs with a noun in the seed list to the total frequency of the noun in the corpus. This statistic favors low frequency nouns, and thus necessitates the inclusion of a minimum occurrence cutoff. They stipulated that no word occuring fewer than six times in the corpus would be considered by the algorithm. This cutoff has two effects: it reduces the noise associated with the multitude of low frequency words, and it removes from consideration a fairly large number of certainly valid category members. Ideally, one would like to reduce the noise without reducing the number of valid nouns. Our statistics allow for the inclusion of rare occcurances. Note that this is particularly important given our algorithm, since we have restricted the relevant occurrences to a specific type of structure; even relatively common nouns m~v not occur in the corpus more than a handful of times in such a context.</Paragraph>
    <Paragraph position="1"> The two figures of merit that we employ, one to select and one to produce a final rank, use the following two counts for each noun:  1. a noun's co-occurrences with seed words 2. a noun's co-occurrences with any word  To select new seed words, we take the ratio of count 1 to count 2 for the noun in question. This is similar to the figure of merit used in R&amp;:S, and also tends to promote low frequency nouns. For the final ranking, we chose the log likelihood statistic outlined in Dunning (1993), which is based upon the co-occurrence counts of all nouns (see Dunning for details). This statistic essentially measures how surprising the given pattern of co-occurrence would be if the distributions were completely random. For instance, suppose that two words occur forty times each, iiii and they co-occur twenty times in a million-word corpus. This would be more surprising for two completely random distributions than if they had each occurred twice and had always co-occurred. A simple probability does not capture this fact.</Paragraph>
    <Paragraph position="2"> The rationale for using two different statistics for this task is that each is well suited for its particular role, and not particularly well suited to the other. We have already mentioned that the simple ratio is ill suited to dealing with infrequent occurrences. It is thus a poor candidate for ranking the final output, if that list includes words of as few as one occurrence in the corpus. The log likelihood statistic, we found, is poorly suited to selecting new seed words in an iterative algorithm of this sort, because it promotes high frequency nouns, which can then overly influence selections in future iterations, if they are selected as seed words. We termed this phenomenon infection, and found that it can be so strong as to kill the further progress of a category. For example, if we are processing the category vehicle and the word artillery is selected as a seed word, a whole set of weapons that co-occur with artillery can now be selected in future iterations. If one of those weapons occurs frequently enough, the scores for the words that it co-occurs with may exceed those of any vehicles, and this effect may be strong enough that no vehicles are selected in any future iteration.</Paragraph>
    <Paragraph position="3"> In addition, because it promotes high frequency terms, such a statistic tends to have the same effect as a minimum occurrence cutoff, i.e. few if any low frequency words get added. A simple probability is a much more conservative statistic, insofar as it selects far fewer words with the potential for infection, it limits the extent of any infection that does occur, and it includes rare words. Our motto in using this statistic for selection is, &amp;quot;First do no harm.&amp;quot;</Paragraph>
  </Section>
  <Section position="5" start_page="1110" end_page="1110" type="metho">
    <SectionTitle>
4 Seed word selection
</SectionTitle>
    <Paragraph position="0"> The simple ratio used to select new seed words will tend not to select higher frequency words in the category. The solution to this problem is to make the initial seed word selection from among the most frequent head nouns in the corpus. This is a sensible approach in any case, since it provides the broadest coverage of category occurrences, from which to select additional likely category members. In a task that can suffer from sparse data, this is quite important. We printed a list of the most common nouns in the corpus (the top 200 to 500), and selected category members by scanning through this list. Another option would be to use head nouns identified in Wordnet, which, as a set, should include the most common members of the category in question. In general, however, the strength of an algorithm of this sort is in identifying infrequent or specialized terms. Table 1 shows the seed words that were used for some of the categories tested.</Paragraph>
  </Section>
  <Section position="6" start_page="1110" end_page="1112" type="metho">
    <SectionTitle>
5 Compound Nouns
</SectionTitle>
    <Paragraph position="0"> The relationship between the nouns in a compound noun is very different from that in the other constructions we are considering. The non-head nouns in a compound noun may or may not be legitimate members of the category.</Paragraph>
    <Paragraph position="1"> For instance, either pickup truck or pickup is a legitimate vehicle, whereas cargo plane is legitimate, but cargo is not. For this reason, co-occurrence within noun compounds is not considered in the iterative portions of our algorithm. Instead, all noun compounds with a head that is included in our final ranked list, are evaluated for inclusion in a second list.</Paragraph>
    <Paragraph position="2"> The method for evaluating whether or not to include a noun compound in the second list is intended to exclude constructions such as government plane and include constructions such as fighter plane. Simply put, the former does not correspond to a type of vehicle in the same way that the latter does. We made the simplifying assumption that the higher the probability of the head given the non-head noun, the better the construction for our purposes. For instance, if the noun government is found in a noun compound, how likely is the head of that compound to be plane? How does this compare to the noun fighter? For this purpose, we take two counts for each noun in the compound: 1. The number of times the noun occurs in a noun compound with each of the nouns to its right in the compound 2. The number of times the noun occurs in a noun compound For each non-head noun in the compound, we  evaluate whether or not to omit it in the output. If all of them are omitted, or if the resulting compound has already been output, the entry is skipped. Each noun is evaluated as follows: First, the head of that noun is determined.</Paragraph>
    <Paragraph position="3"> To get a sense of what is meant here, consider the following compound: nuclear-powered aircraft carrier. In evaluating the word nuclearpowered, it is unclear if this word is attached to aircraft or to carrier. While we know that the head of the entire compound is carrier, in order to properly evaluate the word in question, we must determine which of the words following it is its head. This is done, in the spirit of the Dependency Model of Lauer (1995), by selecting the noun to its right in the compound with the highest probability of occuring with the word in question when occurring in a noun compound. (In the case that two nouns have the same probability, the rightmost noun is chosen.) Once the head of the word is determined, the ratio of count 1 (with the head noun chosen) to count 2 is compared to an empirically set cutoff. If it falls below that cutoff, it is omitted. If it does not fall below the cutoff, then it is kept (provided its head noun is not later omitted).</Paragraph>
  </Section>
  <Section position="7" start_page="1112" end_page="1112" type="metho">
    <SectionTitle>
6 Outline of the algorithm
</SectionTitle>
    <Paragraph position="0"> The input to the algorithm is a parsed corpus and a set of initial seed words for the desired category. Nouns are matched with their plurals in the corpus, and a single representation is settled upon for both, e.g. car(s). Co-Occurrence bigrams are collected for head nouns according to the notion of co-occurrence outlined above.</Paragraph>
    <Paragraph position="1"> The algorithm then proceeds as follows:  1. Each noun is scored with the selecting statistic discussed above.</Paragraph>
    <Paragraph position="2"> 2. The highest score of all non-seed words is determined, and all nouns with that score are added to the seed word list. Then return to step one and repeat. This iteration continues many times, in our case fifty.</Paragraph>
    <Paragraph position="3"> 3. After the number of iterations in (2) are completed, any nouns that were not selected as seed words are discarded. The seed word set is then returned to its original members.</Paragraph>
    <Paragraph position="4"> 4. Each remaining noun is given a score based upon the log likelihood statistic discussed above.</Paragraph>
    <Paragraph position="5"> 5. The highest score of all non-seed words is determined, and all nouns with that score are added to the seed word list. We then return to step (5) and repeat the same number of times as the iteration in step (2). 6. Two lists are output, one with head nouns,  ranked by when they were added to the seed word list in step (6), the other consisting of noun compounds meeting the outlined criterion, ordered by when their heads were added to the list.</Paragraph>
  </Section>
  <Section position="8" start_page="1112" end_page="1113" type="metho">
    <SectionTitle>
7 Empirical Results and Discussion
</SectionTitle>
    <Paragraph position="0"> We ran our algorithm against both the MUC-4 corpus and the Wall Street Journal (WSJ) corpus for a variety of categories, beginning with the categories of vehicle and weapon, both included in the five categories that R~S investigated in their paper. Other categories that we investigated were crimes, people, comm.ercial sites, states (as in static states of affairs), and machines. This last category was run because of the sparse data for the category weapon in the Wall Street Journal. It represents roughly the same kind of category as weapon, namely technological artifacts. It, in turn, produced sparse results with the MUC-4 corpus. Tables 3 and 4 show the top results on both the head noun and the compound noun lists generated for the categories we tested.</Paragraph>
    <Paragraph position="1"> R~S evaluated terms for the degree to which they are related to the category. In contrast, we counted valid only those entries that are clear members of the category. Related words (e.g.</Paragraph>
    <Paragraph position="2">  crash for the category vehicle) did not count.</Paragraph>
    <Paragraph position="3"> A valid instance was: (1) novel (i.e. not in the original seed set); (2) unique (i.e. not a spelling variation or pluralization of a previously encountered entry); and (3) a proper class within the category (i.e. not an individual instance or a class based upon an incidental feature). As an illustration of this last condition, neither Galileo Probe nor gray plane is a valid entry, the former because it denotes an individual and the latter because it is a class of planes based upon an incidental feature (color).</Paragraph>
    <Paragraph position="4"> In the interests of generating as many valid entries as possible, we allowed for the inclusion in noun compounds of words tagged as adjectives or cardinality words. In certain occasions (e.g. four-wheel drive truck or nuclear bomb) this is necessary to avoid losing key parts of the compound. Most common adjectives are dropped in our compound noun analysis, since they occur with a wide variety of heads.</Paragraph>
    <Paragraph position="5"> We determined three ways to evaluate the output of the algorithm for usefulness. The first is the ratio of valid entries to total entries produced. R&amp;S reported a ratio of .17 valid to total entries for both the vehicle and weapon categories (see table 2). Oil the same corpus, our algorithm yielded a ratio of .329 valid to total entries for the category vehicle, and .36 for the category weapon. This can be seen in the slope of the graphs in figure 1. Tables 2 and 5 give the relevant data for the categories that we investigated. In general, the ratio of valid to total entries fell between .2 and .4, even in the cases that the output was relatively small.</Paragraph>
    <Paragraph position="6"> A second way to evaluate the algorithm is by the total number of valid entries produced. As can be seen from the numbers reported in table 2, our algorithm generated from 2.4 to nearly 3 times as many valid terms for the two contrasting categories from the MUC corpus than the algorithm of RPS:S. Even more valid terms were generated for appropriate categories using the Wall Street Journal.</Paragraph>
    <Paragraph position="7"> Another way to evaluate the algorithm is with the number of valid entries produced that are not in Wordnet. Table 2 presents these numbers for the categories vehicle and weapon. Whereas the R&amp;S algorithm produced just 11 terms not already present in Wordnet for the two categories combined, our algorithm produced 106,</Paragraph>
    <Paragraph position="9"> Weapon or over 3 for every 5 valid terms produced. It is for this reason that we are billing our algorithm as something that could enhance existing broad-coverage resources with domain-specific lexical information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML