File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2082_metho.xml
Size: 22,074 bytes
Last Modified: 2025-10-06 14:13:00
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-2082"> <Title>Automatic Acquisition of Hyponyms ~om Large Text Corpora</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Lexico-Syntactic Patterns </SectionTitle> <Paragraph position="0"> for Hyponymy Since only a subset of the possible instances of the hyponymy relation will appear in a particular form, we need to make use of as many patterns as possible. Below is a list of lexico-syntactie patterns that indicate the hyponymy relation, followed by illustrative sentence fragments and the predicates that can ACTI~S DE COLING-92, NANTES, 23-28 AOt~r 1992 5 4 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 be derived from them (detail about the environment surrounding tile patterns is omitted for simplicity): (2) .... h NP us {NP ,}* {(or \[ and)} NP ... works by such authors as Herrick, Goldsmith, and Shakespeare.</Paragraph> <Paragraph position="1"> : ~. hyf)onym I'~author&quot;, &quot;Ilerrick'), llyponym( &quot;author&quot;, &quot;(;oldsmith &quot;), hyponynl( &quot;author&quot;, &quot;Shakespeare&quot;) (3) NP {, NP} * {,} o,' other NP Bruises, wounds, broken bones or other injuries . . .</Paragraph> <Paragraph position="2"> ~... hyponym( &quot;bruise&quot;. &quot;injury&quot;), hyponym ( &quot;wo und&quot;, &quot;mj ury&quot; ), hyponym( &quot;broken bone&quot;, &quot;injury&quot;) (4) NP {, NP}* {,} and other NP ... temples, treasuries,altd other important civic buildings.</Paragraph> <Paragraph position="3"> :~- hyponym(&quot;tenlple&quot;, &quot;civic' building&quot;), hyponym( &quot;treasury &quot;, &quot;civic building&quot;) (5) m, {,} .~clsa,,~y {NP 5* {o,. ' ..a} NP All common-law countries, including Canada and England ...</Paragraph> <Paragraph position="4"> -~, hyponym( &quot;Canada&quot;, &quot;collllnou--law coon try&quot;), flyponym ( &quot;Eng\]and&quot;, &quot;common-law co lm try&quot;) null (6) NP {,} especially {NP ,}* {or\] and} NP ... most: European countries, especially France, England, and Spain.</Paragraph> <Paragraph position="5"> ~ hyponym( &quot;France&quot;, &quot;European country&quot;), hyponym( &quot;England&quot;, &quot;European country&quot;), hypouym( &quot;Spain&quot;, &quot;European country&quot;) When a relation hyponym(NPo, NI'I) is discovered, aside from some temmatizing and removal of unwanted modifiers, tile uonn phrase is left as all atomic unit, not broken clown and analyzed. Ira more detailed interpretation is desired, the results can be passed on to a more intelligent or specialized language analysis component. And, as mentioned above, this kind of discovery procedure can be a partial solution for a problenr like noun phrase interpretation because at least part of the meaning of the phrase is indicated by tile hyponymy relation.</Paragraph> <Paragraph position="6"> and we usually want them to be singular. Adjectival quantiflers such as &quot;other&quot; and &quot;some&quot; are usually undesirable and can be eliminated in most cases without making the statement of tile hypouym relation erroneous. ('omparatives SUCh as &quot;inlportaat&quot; and &quot;smaller&quot; are usually best removed, since their meaning \[s relative and dependent on tile context in which they appear.</Paragraph> <Paragraph position="7"> Ilow much modification is desirable depends on the application to which the lexical relations will be put. For budding up a basic, general-domain thesaurus, single-word uouns and very cOnllnon colnpouuds are most appropriate. For a inore specialized domain, umre modified terms have their place. Per example, noun phrases in ~he me(licai C/lontain otteu have several layers of modification which should be preserved in a taxonomy of medical terms.</Paragraph> <Paragraph position="8"> Other difficulties and concerns are discussed ill Section a.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Discovery of New Patterns </SectionTitle> <Paragraph position="0"> How can these patterns be found? Initially we discovered patterns (1)- (3) 5y observation, looldug through text and noticing die patterns and tile relationships they indicate, lu order to find new patterns automatically, we sketch the following procedure: 1. l)ecide on a lexical relation, R, that is of interest, e.g., &quot;gro up/member&quot;(iu our formulation this is a subset of the hypouylny relation).</Paragraph> <Paragraph position="1"> 2. Gather a list of terms for which this rela- null tion is known to hold, e.g., &quot;England-country'. This list can be found autonmtically using the method described here, bootstrapping from patterns found by hand, or by bootstrapping from an existing lexicon or knowledge base.</Paragraph> <Paragraph position="2"> 3. Find places in tile corpus where these expressions occur syntactically near one another and record the environment.</Paragraph> <Paragraph position="3"> 4. t,'ind the commonaflties among these environi~leuts and hypothesize that corn.men ones yield patterns that indicate the relation of interest. 5. Once a new pattern has been positively identified, use it to gather more instances of the target relation and go to Step 2.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Some Considerations </SectionTitle> <Paragraph position="0"> In example (4) above, the full noun phrase corresponding to the hypernym is &quot;other important civic buildings&quot;. This illustrates a difficulty that arises from using free text as the data source, as opposed to a dictionary - often the form that a noun phrase occurs in is not what we would like to record. For example, nouns frequently occur in their plural form We tried this procedure by hand using just one pair of terms at a time. In the first case we tried the &quot;Fngland-country&quot; example, and with just this pair we tound uew patterns (4) and (5), as well as (1) (3) which were already known. Next we tried &quot;tankvehicle&quot; and discovered a very productive pattern, pattern (6). (Note that for this pattern, even though it has an emphatic element, this does not affect the fact that the relation indicated is hypouymic.) AcrEs DE COLING-92, N^mEs, 23-28 hotrr 1992 5 4 1 l)Roc, ov COLING-92, NAbrrEs, AUG. 23-28, 1992 We have tried applying this technique to meronymy (i.e., the part/whole relation), but without great success. The patterns fotu~.d for this relation do not tend to uniquely identify it, but can be used to express other relations as well. It may be the case that in English the hyponymy relation is especially amenable to this kind of analysis, perhaps due to its &quot;naming&quot; nature. However, we have bad some success at identification of more specific relations, such as patterns that indicate certain types of proper nouns.</Paragraph> <Paragraph position="1"> We have not implemented an automatic version of this algorithm, primarily because Step 4 is underdetermined. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Related Work </SectionTitle> <Paragraph position="0"> This section discusses work in acquisition of lexical information from text corpora, although as mentioned earlier, significant work has been done in acquiring lexical information from MRDs.</Paragraph> <Paragraph position="1"> (Coates-Stephens 1991) acquires semantic descriptions of proper nouns in a system called FUNES. FUNES attempts to fill in frame roles, (e.g., name, age~ origin, position, and works-for, for a person frame) by processing newswire text. This system is similar to the work described here in that it recognizes some features of the context in which the proper noun occurs in order to identify some relevant semantic attributes. For instance. Coates-Stephens mentions that &quot;known as&quot; can explicitly introduce meanings for terms, as can appositives. We also have considered these markers, hut the tbrmer often does not cleanly indicate &quot;another name for&quot; and the latter is difficult to recognize accurately. FUNES differs quite strongly from our approach in that, because it is able to fill in many kinds of frame roles, it requires a parser that produces a detailed structure, and it requires a domain-dependent knowlege base/lexicon.</Paragraph> <Paragraph position="2"> (Velardi & Pazienza 1989) makes use of hand-coded selection restriction and conceptual relation rules in order to assign case roles to lexical items, and (Jacobs & Zernik 1988) uses extensive domain knowledge to fill in missing category information for unknown words.</Paragraph> <Paragraph position="3"> Work on acquisition of syntactic information from text corpora includes Brent's (Brent 1991) verb subcategorization frame recognition technique and</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Smadja's (Smadja & McKeown 1990) collocation ac- </SectionTitle> <Paragraph position="0"> quisition algorithm. (Calzolari & Bindi 1990) use corpus-based statistical association ratios to determine lexical information such as prepositional complementation relations, modification relations, and significant compounds.</Paragraph> <Paragraph position="1"> Our methodology is similar to Brent's in its effort to distinguish clear pieces of evidence from ambiguous ones. The assumption is that that given a large enough corpus, the algorithm can afford wait until it encounters clear examples. Brent's algorithm relies on a clever trick: in the configuration of interest (in this case, verb valence descriptions), where noun phrases are the source of ambiguity, it uses only sentences which have pronouns in the crucial position, since pronouns do not allow this ambiguity. This approach is qnite effective, but the disadvantage is that it isn't clear that it is applicable to any other tasks. The approach presented in this paper, using the algorithm sketched in the previous subsection, is potentially extensible.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Incorporating Results into </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> WordNet </SectionTitle> <Paragraph position="0"> To validate this acquisition method, we compared the results of a restricted version of the algorithm with information found in WordNet. 2 WordNet (Miller et al. 1990) is a hand-built online thesaurus whose organization is modeled after the results of psycbolinguistic research. To use tile authors' words, Wordnet &quot;... is an attempt to organize lexical information in terms of word meanings, rather than word forms. In that respect, WordNet resembles a thesaurus more than a dictionary ...&quot; To this end, word forms with synonymous meanings are grouped into sets, called synsets. This allows a distinction to be made between senses of homographs. For example, the noun &quot;board&quot; appears in the synsets {board, plank} and {board, committee}, and this grouping serves for the most part as the word's definition. In version 1.1, WordNet contains about 34,000 noun word forms, including some compounds and proper nouns, organized into about 26,000 synsets. Noun synsets are organized hierarchically according to the hyponymy relation with implied inheritance and are further distinguished by values of features such as meronymy.</Paragraph> <Paragraph position="1"> WordNet's coverage and structure are impressive and provide a good basis for an automatic acquisition algorithm to build on.</Paragraph> <Paragraph position="2"> When comparing a result hyponym(No,Nt) to the contents of WordNet's noun hierarchy, three kinds of outcomes are possible: Verify. If both No and Nt are in WordNet, and if the relation byponym(No,N1) is in the hierarchy (possibly througi~ transitive closure) then the thesaurus is verified.</Paragraph> <Paragraph position="3"> Critique. If both No and N1 are in WordNet, and if the relation hyponym(No, N1) is not in the hierarchy (even through transitive closure) then the thesaurus is critiqued, i.e., a new set of hyponym connections is suggested.</Paragraph> <Paragraph position="4"> Augment. If one or both of No and NI are not present then these noun phrases and their relation are suggested as entries.</Paragraph> <Paragraph position="5"> As an example of critiquing, consider the following The text indicates that a printer is a kind of input-output device. Figure 1 indicates tile portion of tile hyponymy relation in WordNet's noun hierarchy that has to do with printers and devices. Note ;although the terms device and printer are present, they are not linked in such as way as to allow the easy insertion UO device under the more general dewce and over the more specific printer. Although it is not obvious what to suggest to fix this portion of the hierarchy from this one relation ~done, it is clear that its discovery liqueurs: anisette* absinthe* rocks: graltlte* substances: phosphorus* nitrogen* species: stuatornis oilbirds bivalves: scallop* fungi: smuts* rusts* fabrics: acrylics* nylon* silk* antibiotlcS: amplcillin erythromycln* institutions: temples king seabirds: penguins albatross* flatworms: tapeworms pla~aria amphibians: frogs* ~aterfowl: ducks legumes: lentils* beans* nuts org~lisms: horsetails ferns mosses rivers: Sevier Ca\[rson Humboldt fruit: olives* grapes* hydrocarbons: benzene gasol+-ne ideologies: liberalism conservatism industries: steel iron shoes min.rals: pyrite* galena phenomena: lightning* infection; menlngltis dyes: quercitron Most of the terms in WordNet's noun hierarchy are unmodified nouns or nouns with a single modifier. For this reason, ill this experiment we only extracted relations consisting of mmmdified nouns in both the hypernym and hypouym roles (although determiners are allowed and a very small set of quantifier adjectives: &quot;some&quot;, &quot;many&quot;, &quot;certain&quot;, and &quot;other&quot;). Making this restriction is also usethl because of the difficulties with determining which modifiers are significant, as touched on above, and because it seems easier to make a judgement call about the correctness of the classification of unmodified nouns for evaluation purposes.</Paragraph> <Paragraph position="6"> Since we are trying to acquire lexical information our parsing mechanism should not be one that requires extensive lexicat information. In order to detect the lexico-syntactic patterns, we use a unification-based constituent analyzer (taken from (Batali 1991)), which builds on the output of a part-or=speech tagger (Cutting el al. 1991). (All code described in this report is written m Common Lisp and run on Sun SparcStations.) We wrote grammar rules for the constituent analyzer to recognize the pattern in (la). As mentioned above, in this experiment we are detecting only unmodified nouns. Therefore, when a noun is found in the hypernym position, that is, before the lexemes &quot;such as&quot;, we check for the noun's inclusion in a relative clause, or as part of a larger noun phrase that includes an appositive or a parenthetical. Using tile constituent analyzer, it is not necessary to parse the entire selltence; instead we look at just enough local context around the iexical items in the pattern to ensure that tile nouns in tile pattern are isolated.</Paragraph> <Paragraph position="7"> After the hypernym is detected the hyponyms are identified. Often they occur ill a llst and each element ill the list holds a hyponym relation with the hypernym. The main difficulty here lies m determining the extent of the last term in the list.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Results and Evaluation </SectionTitle> <Paragraph position="0"> Figure 2 illustrates some of the results of a run of the acquisition algorithm on Grolier's American Academic Encyelopedia(Grolier 1990), where a restricted version of pattern (la) is the target (space constraints do not allow a full listing of the results). After the relations are found they are looked up in WordNet. We placed the WordNet noun hierarchy into a b-tree data structure for efficient retrieval and update and used a breadth-first-search to search through the transitive closure.</Paragraph> <Paragraph position="1"> Ont of 8.6M words of encyclopedia text, there are AcrEs DE COL1NG-92, NANt .'F.S, 23-28 ho,,~'r 1992 5 4 3 Paoc. ov COLING-92, NANTES, AUO. 23-28, 1992 7067 sentences that contain tile lexemes &quot;such as&quot; contiguously. Out of these, 152 relations fit tile restrictions of the experiment, namely that both the hyponyms and the hypernyms are unmodified (with the exceptions mentioned above). When the restrictions were eased slightly, so that NPs consisting of two nouns or a present/past participle plus a noun were allowed, 330 relations were found. Wheu the latter experiment was run o21 about 20M words of New York Times text, 3178 sentences contained &quot;such as&quot; contiguously, and 46 relations were found using the strict no-modifiers criterion.</Paragraph> <Paragraph position="2"> Wilen the set of t52 Grolier's relations was looked up in WordNet, 180 out of the 226 mlique words involved in the relations actually existed in the hierarchy, and 61 out of the 106 feasible relations (i.e., relations in which both terms were already registered in Word-Net) were found.</Paragraph> <Paragraph position="3"> The quality of the relations found seems high overall, although there are difficulties. As to be expected, metonymy occurs, as seen in hyponym(&quot;king&quot;, &quot;institution&quot;). A more common problem is underspecification. For example, one relation is hyponym( &quot;steatornis', &quot;species&quot;), which is problematic because what kind of species needs to be known and most likely this reformation was mentioned in the previous sentence. Similarly, relations were found between &quot;device&quot; and &quot;plot&quot;, &quot;metaphor&quot;, and &quot;character&quot;, underspecifying the fact that literary devices of some sort are under discussion.</Paragraph> <Paragraph position="4"> Sometimes the relationship expressed is slightly askance of the norm. For example, the algorithm finds hyponym( &quot;Washington&quot;, &quot;nationalist&quot;)and hyponym( &quot;aircraft&quot;, &quot;target&quot;) which are somewhat context and point-of-view dependent. This is not necessarily a problem; as mentioned above, finding alternative ways of stating similar notions is one of our goals. However, it is important to try to distinguish the more canonical and context-independent relations for entry in a thesaurus.</Paragraph> <Paragraph position="5"> There are a few relations whose hypernyms are very high-level terms, e.g., &quot;substance&quot; aud &quot;form&quot;. These are not incorrect; they just may not be as useful as more specific relations.</Paragraph> <Paragraph position="6"> Overall, the results are encouraging. Although the number of relations found is small compared to the size of the text used, this situation can he greatly improved in several ways. Less stringent restrictions will increase the numbers, as the slight loosening shown in the Grolier's experiment indicates. A more savvy grammar for the constituent analyzer should also increase the results.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Automatic Updating </SectionTitle> <Paragraph position="0"> The question arises as to how to automatically insert relations between terms into the hierarchy. This involves two main difficulties. First, if both lexical expressions are present in the noun hierarchy but one or both }lave more than one sense, the algorithm must decide which senses to link together. We have preliminary ideas as to how to work around this problem.</Paragraph> <Paragraph position="1"> Say the hyponym in question has only one sense, but the hypernym has several. Then the task is simplified to determining which sense of the hypernym to link the hypouym to. We can then make use of a lexical disambiguation algorithm, e.g., (Hearst 1991), to determine which sense of the hypernym is being used iu the sample sentence.</Paragraph> <Paragraph position="2"> Furthermore, since we've assumed the hyponym has only one main sense we could do tile following: Look through a corpus for occurrences of the hyponym and see if its environment tends to be similar to one of the senses of its hypernym. For example, if the hypernym is &quot;bank&quot; and the hyponym is &quot;First National&quot;, every time, within a sample of text, the term &quot;First National&quot; occurs, replace it with &quot;bank&quot;, and then run the disambiguation algorithm as usual. If this term can be positively classified as having one sense of bank over the others, then this would provide strong evidence as to which sense of the hypernym to link the hypouym to. This idea is purely speculative; we have not yet tested it.</Paragraph> <Paragraph position="3"> The second main problem with inserting new relations arises when one or both terms do not occur in the hierarchy at all. In this case, we have to determine which, if any, existing synset the term belongs in and then do the sense determination mentioned above.</Paragraph> </Section> </Section> class="xml-element"></Paper>