XML Viewer - w03-0415

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0415_intro.xml
Size: 8,136 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0415">
  <Title>Using LSA and Noun Coordination Information to Improve the Precision and Recall of Automatic Hyponymy Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Pattern-Based Hyponymy Extraction
</SectionTitle>
    <Paragraph position="0"> The first major attempt to extract hyponyms from text was that of Hearst (1992), described in more detail in (Hearst, 1998), who extracted relationships from the text of Grolier's Encyclopedia. The method is illustrated by the following example. The sentence excerpt Even then, we would trail behind other European Community members, such as Germany, France and Italy. . . (BNC)2 indicates that Germany, France, and Italy are all European Community members. More generally, phrases of the form x such as y1 (y2, . . . , and/or yn) frequently indicate that the yi are all hyponyms of the hypernym x. Hearst identifies several other constructions that have a tendency to indicate hyponymy, calling these constructions lexicosyntactic patterns, and analyses the results. She reports that 52% of the relations extracted by the &amp;quot;or other&amp;quot; pattern (see Table 1) were judged to be &amp;quot;pretty good relations&amp;quot;. A more recent variant of this technique was implemented by Alfonseca and Manandhar (2001), who compare the collocational patterns of words from The Lord of the Rings with those of words in the WordNet taxonomy, adding new nouns to WordNet with an accuracy of 28%. Using a much more knowledge-intensive approach, Hahn and Schnattinger (1998) improve &amp;quot;learning accuracy&amp;quot; from around 50% to over 80% by forming a number of hypotheses and accepting only those which are most consistent with their current ontology. Their methods are like ours in that the &amp;quot;concept learning&amp;quot; combines information from several occurrences, but differ in that they rely on a detailed existing ontology into which to fit the new relationships between concepts.</Paragraph>
    <Paragraph position="1"> Our initial experiment was to construct a hyponymy extraction system based on the six lexicosyntactic patterns identified in (Hearst, 1998), which are listed in Table 1. We first used a chunker to mark noun groups, and then recognized and extracted noun groups occurring as part of one of the extraction patterns.3 We applied these extraction patterns to an approximately 430,000-word extract from the beginning of the</Paragraph>
    <Paragraph position="3"> Hearst (1998), which we used in the work described in this paper. Each of these patterns is taken to indicate the hyponymy relation(s) yi a60x.</Paragraph>
    <Paragraph position="4"> British National Corpus (BNC). The patterns extracted 513 relations. We selected 100 of the extracted relations at random and each author evaluated them by hand, scoring each relation on a scale from 4 (correct) to 0 (incorrect), defined as follows:  4. Extracted hypernym and hyponym exactly correct as extracted.</Paragraph>
    <Paragraph position="5"> 3. Extracted hypernym and hyponym are correct after a slight modification, such as depluralization or the  removal of an article (e.g. a, the) or other preceding word.</Paragraph>
    <Paragraph position="6"> 2. Extracted hypernym and hyponym have something correct, e.g. a correct noun without a necessary prepositional phrase, a correct noun with a superfluous prepositional phrase, or a noun + prepositional phrase where the object of the preposition is correct but the preposition itself and the noun to which it attaches are superfluous. Thus these hyponymy relations are potentially correct but will require potentially difficult processing to extract an exactly correct relation. Some of the errors which would need to be corrected were in preprocessing (e.g. on the part of the noun-group chunker) and others were errors caused by our hyponymy extractor (e.g. tacking on too many or too few prepositional phrases).</Paragraph>
    <Paragraph position="7"> 1. The relation extracted is correct in some sense, but is too general or too context specific to be useful. This category includes relations that could be made useful by anaphora resolution (e.g. replacing &amp;quot;this&amp;quot; with its referent).</Paragraph>
    <Paragraph position="8"> 0. The relation extracted is incorrect. This results when the constructions we recognize are used for a purpose other than indicating the hyponymy relation. The results of each of the authors' evaluations of the  relations (of 513 extracted) to which each of the authors assigned the five available scores.</Paragraph>
    <Paragraph position="9"> purposes of calculating precision, we consider those relations with a score of 4 or 3 to be correct and those with a lower score to be incorrect. After discussion between the authors on disputed annotations to create &amp;quot;gold standard&amp;quot; annotations, we found that 40 of the 100 relations in our random sample were correct according to this criterion. In other words, 40% of the relations extracted were exactly correct or would be correct with the use of minor post-processing consisting of lemmatization and removal of common types of qualifying words. (We describe our application of such post-processing in Section 6.) Thus our initial implementation of Hearst-style hyponymy extraction achieved 40% precision. This is less than the 52% precision reported in (Hearst, 1998). We believe this discrepancy to be mainly due to the difference between working with the BNC and Grolier's encyclopedia--as noted by Hearst, the encyclopedia is designed to be especially rich in conceptual relationships presented in an accessible format.</Paragraph>
    <Paragraph position="10"> Various problems with the pattern-based extraction method explain the 60% of extracted relations that were incorrect and/or useless. One problem is that the constructions that we assume to indicate hyponymy are often used for other purposes. For instance, the pattern x including y1, y2, . . . , and yn which indicates hyponymy in sentences such as Illnesses, including chronic muscle debility, herpes, tremors and eye infections, have come and gone. (BNC) and is a quite productive source of hyponymy relations, can be used instead to indicate group membership: agreement regarding the assignment of scores of 4, 3, and 2 is quite high. Indeed, considering the rougher distinction we use for reporting precision, in which scores of 4 and 3 are deemed correct and scores of 2, 1, and 0 are deemed incorrect, we found that inter-annotator agreement across all relations annotated (including those from this random sample and those from the sample described in Section 3) was 86%. We discussed each of the relations in the 14% of cases where we disagreed until we reached agreement; this produced the &amp;quot;gold standard&amp;quot; annotations to which we refer.</Paragraph>
    <Paragraph position="11"> Often entire families including young children need practical home care . . . (BNC) While all children are members of families, the hyponymy relationship child a60 family does not hold, since it is not true that all children are families.</Paragraph>
    <Paragraph position="12"> Another source of errors in lexicosyntactic hyponymy extraction is illustrated by the sentence A kit such as Edme Best Bitter, Tom Caxton Best Bitter, or John Bull Best Bitter will be a good starting kit. (BNC) which indicates the (potentially useful) relations Edme Best Bitter a60 beer-brewing kit, Tom Caxton Best Bitter a60 beer-brewing kit, and John Bull Best Bitter a60 beer-brewing kit, but only when we use the context to infer that the type of &amp;quot;kit&amp;quot; referred to is a beer-brewing kit, a process that is difficult by automatic means. Without this inference, the extracted relations Edme Best Bitter a60 kit, etc., while correct in a certain sense, are not helpful. One frequent source of such problems is anaphora that require resolution.</Paragraph>
    <Paragraph position="13"> There are also problems related to prepositional phrase attachment.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML