File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1017_metho.xml
Size: 22,890 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1017"> <Title>Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Semantic Lexicon Learning </SectionTitle> <Paragraph position="0"> The goal of our research is to automatically generate a semantic lexicon. For our purposes, we define a semantic lexicon to be a list of words with semantic category labels. For example, the word &quot;bird&quot; might be labeled as an ANIMAL and the word &quot;car&quot; might be labeled as a VEHICLE. Semantic lexicons have proven to be useful for many lan-Association for Computational Linguistics.</Paragraph> <Paragraph position="1"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 125-132. Proceedings of the Conference on Empirical Methods in Natural guage processing tasks, including anaphora resolution (Aone and Bennett, 1996; McCarthy and Lehnert, 1995), prepositional phrase attachment (Brill and Resnik, 1994), information extraction (Soderland et al., 1995; Riloff and Schmelzenbach, 1998), and question answering (Harabagiu et al., 2000; Hirschman et al., 1999).</Paragraph> <Paragraph position="2"> Some general-purposes semantic dictionaries already exist, such as WordNet (Miller, 1990). Word-Net has been used for many applications, but it may not contain the vocabulary and jargon needed for specialized domains. For example, WordNet does not contain much of the vocabulary found in medical texts. In previous research on semantic lexicon induction, Roark and Charniak (Roark and Charniak, 1998) showed that 3 of every 5 words learned by their system were not present in WordNet. Furthermore, they used relatively unspecialized text corpora: Wall Street Journal articles and terrorism news stories. Our goal is to develop techniques for semantic lexicon induction that could be used to enhance existing resources such as WordNet, or to create dictionaries for specialized domains.</Paragraph> <Paragraph position="3"> Several techniques have been developed to generate semantic knowledge using weakly supervised learning techniques. Hearst (Hearst, 1992) extracted information from lexico-syntactic expressions that explicitly indicate hyponymic relationships. Hearst's work is similar in spirit to our work in that her system identified reliable syntactic structures that explicitly reveal semantic associations. Meta-bootstrapping (Riloff and Jones, 1999) is a semantic lexicon learning technique very different from ours which utilizes information extraction patterns to identify semantically related contexts.</Paragraph> <Paragraph position="4"> Named entity recognizers (e.g., (Bikel et al., 1997; Collins and Singer, 1999; Cucerzan and Yarowsky, 1999)) can be trained to recognize proper names associated with semantic categories such as PER-SON or ORGANIZATION, but they typically are not aimed at learning common nouns such as &quot;surgeon&quot; or &quot;drugmaker&quot;.</Paragraph> <Paragraph position="5"> Several researchers have used some of the same syntactic structures that we exploit in our research, namely appositives and compound nouns. For example, Riloff and Shepherd (Riloff and Shepherd, 1997) developed a statistical co-occurrence model for semantic lexicon induction that was designed with these structures in mind. Roark and Charniak (Roark and Charniak, 1998) followed up on this work by using a parser to explicitly capture these structures. Caraballo (Caraballo, 1999) also exploited these syntactic structures and applied a cosine vector model to produce semantic groupings. In our view, these previous systems used &quot;weak&quot; syntactic models because the syntactic structures sometimes identified desirable semantic associations and sometimes did not. To compensate, statistical models were used to separate the meaningful semantic associations from the spurious ones. In contrast, our work aims to identify &quot;strong&quot; syntactic heuristics that can isolate instances of general structures that reliably identify the desired semantic relations.</Paragraph> </Section> <Section position="4" start_page="0" end_page="5" type="metho"> <SectionTitle> 3 A Bootstrapping Model that Exploits </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Strong Syntactic Heuristics </SectionTitle> <Paragraph position="0"> For the purposes of this research, we will define two distinct types of lexicons. One lexicon will consist of proper noun phrases, such as &quot;Federal Aviation Administration&quot;. We will call this the PNP (proper noun phrase) lexicon. The second lexicon will consist of common (non-proper) nouns, such as &quot;airplane&quot;. We will call this the GN (general noun) lexicon. The reason for creating these distinct lexicons is that our algorithm takes advantage of syntactic relationships between proper nouns and general nouns.</Paragraph> </Section> <Section position="2" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 3.1 Syntactic Heuristics </SectionTitle> <Paragraph position="0"> Our goal is to build a semantic lexicon of words that belong to the same semantic class. More specifically, we aim to find words that have the same hypernym, for example &quot;dog&quot; and &quot;frog&quot; would both have the hypernym ANIMAL.</Paragraph> <Paragraph position="1"> We will refer to words that have the same immediate hypernym as semantic siblings. null We hypothesize that some syntactic structures can be used to reliably identify semantic siblings. We have identified three candidates: appositives, compound nouns, and identity clauses whose main verb is a form of &quot;to be&quot; (we will call these ISA clauses). The appropriate granularity of a set of semantic classes, or the organization of a semantic hierarchy, is always open to debate. We chose categories that seem to represent important and relatively general semantic distinctions.</Paragraph> <Paragraph position="2"> While these structures often do capture semantic siblings, they frequently capture other types of semantic relationships as well. Therefore we use heuristics to isolate subsets of these syntactic structures that consistently contain semantic siblings. Our heuristics are based on the observation that many of these structures contain both a proper noun phrase and a general noun phrase which are co-referent and usually belong to the same semantic class. In the following sections, we explain the heuristics that we use for each syntactic structure, and how those structures are used to learn new lexicon entries.</Paragraph> <Paragraph position="3"> Appositives are commonly occurring syntactic structures that contain pairs of semantically related noun phrases. A simple appositive structure consists of a noun phrase (NP), followed by a comma, followed by another NP, where the two NPs are coreferent. However, appositives often signify hypernym relationships (e.g., &quot;the dog, a carnivorous animal&quot;). null To identify semantic siblings, we only use appositives that contain one proper noun phrase and one general noun phrase. For example, &quot;George Bush, the president&quot; or &quot;the president, George Bush&quot;. Theoretically, such appositives could also indicate a hypernym relationship (e.g., &quot;George Bush, a mammal&quot;), but we have found that this rarely happens in practice.</Paragraph> <Paragraph position="4"> Compound nouns are extremely common but they can represent a staggering variety of semantic relationships. We have found one type of compound noun that can be reliably used to harvest semantic siblings. We loosely define these compounds as &quot;GN + PNP&quot; noun phrases, where the compound noun ends with a proper name but is modified with one or more general nouns. Examples of such compounds are &quot;violinist James Braum&quot; or &quot;software maker Microsoft&quot;. One of the difficulties with recognizing these constructs, however, is resolving the ambiguity between adjectives and nouns among the modifiers (e.g., &quot;violinist&quot; is a noun). We only use constructs in which the GN modifier is unambiguously a noun.</Paragraph> <Paragraph position="5"> Certain &quot;to be&quot; clauses can also be harvested to extract semantic siblings. We define an ISA clause as an NP followed by a VP that is a form of &quot;to be&quot;, followed by another NP. These identity clauses also exhibit a wide range of semantic relationships, but harvesting clauses which contain one proper NP and one general NP can reliably identify noun phrases of the same semantic class. We found that this structure yields semantic siblings when the subject NP is constrained to be a proper NP and the object NP is constrained to be a general NP (e.g., &quot;Jing Lee is the president of the company&quot;).</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 The Bootstrapping Model </SectionTitle> <Paragraph position="0"> Figure 1 illustrates the bootstrapping model for each of the three syntactic structures. Initially, the lexicons contain only a few manually defined seed words: some proper noun phrases and some general nouns. The syntactic heuristics are then applied to the text corpus to collect potentially &quot;harvestable&quot; structures. Each heuristic identifies structures with one proper NP and one general NP, where one of them is already present in the lexicon as a member of a desired semantic class. The other NP is then assumed to belong to the same semantic class and is added to a prospective word list. Finally, statistical filtering is used to divide the prospective word lists into exclusive and non-exclusive subsets. We will describe the motivation for this in Section 3.2.2. The exclusive words are added to the lexicon, and the bootstrapping process repeats. In the remainder of this section, we explain how the bootstrapping process works in more detail.</Paragraph> <Paragraph position="1"> The input to our system is a small set of seed words for the semantic categories of interest. To identify good seed words, we sorted all nouns in the corpus by frequency and manually identified the most frequent nouns that belong to each targeted semantic category.</Paragraph> <Paragraph position="2"> Each bootstrapping iteration alternates between using either the PNP lexicon or the GN lexicon to grow the lexicons. As a motivating example, assume that (1) appositives are the targeted syntactic structure (2) bootstrapping begins by using the PNP lexicon, and (3) PEOPLE is the semantic category of interest. The system will then collect all appositives that contain a proper noun phrase known to be a person. So if &quot;Mary Smith&quot; belongs to the PNP lexicon and the appositive &quot;Mary Smith, the analyst&quot; is encountered, the head noun &quot;analyst&quot; will be learned as a person.</Paragraph> <Paragraph position="3"> The next bootstrapping iteration uses the GN lexicon, so the system will collect all appositives that contain a general noun phrase known to be a person.</Paragraph> <Paragraph position="4"> If the appositive &quot;John Seng, the financial analyst&quot; is encountered, then &quot;John Seng&quot; will be learned as a person because the word &quot;analyst&quot; is known to be a person from the previous iteration. The bootstrapping process will continue, alternately using the PNP lexicon and the GN lexicon, until no new words can be learned.</Paragraph> <Paragraph position="5"> We treat proper noun phrases and general noun phrases differently during learning. When a proper noun phrase is learned, the full noun phrase is added to the lexicon. But when a general noun phrase is learned, only the head noun is added to the lexicon. This approach gives us generality because head nouns are usually (though not always) sufficient to associate a common noun phrase with a semantic class. Proper names, however, often do not exhibit this generality (e.g., &quot;Saint Louis&quot; is a location but &quot;Louis&quot; is not).</Paragraph> <Paragraph position="6"> However, using full proper noun phrases can limit the ability of the bootstrapping process to acquire new terms because exact matches are relatively rare.</Paragraph> <Paragraph position="7"> To compensate, head nouns and modifying nouns of proper NPs are used as predictor terms to recognize new proper NPs that belong to the same semantic class. We identify reliable predictor terms using the evidence and exclusivity measures that we will define in the next section. For example, the word &quot;Mr.&quot; is learned as a good predictor term for the person category. These predictor terms are only used to classify noun phrases during bootstrapping and are not themselves added to the lexicon.</Paragraph> <Paragraph position="8"> Our syntactic heuristics were designed to reliably identify words belonging to the same semantic class, but some erroneous terms still slip through for various reasons, such as parser errors and idiomatic expressions. Perhaps the biggest problem comes from ambiguous terms that can belong to several semantic classes. For instance, in the financial domain &quot;leader&quot; can refer to both people and corporations. If &quot;leader&quot; is added to the person lexicon, then it will pull corporation terms into the lexicon during subsequent bootstrapping iterations and the person lexicon will be compromised.</Paragraph> <Paragraph position="9"> To address this problem, we classify all candidate wordsasbeingexclusive to the semantic category or non-exclusive. For example, the word &quot;president&quot; nearly always refers to a person so it is exclusive to the person category, but the word &quot;leader&quot; is nonexclusive. Only the exclusive terms are added to the semantic lexicon during bootstrapping to keep the lexicon as pure (unambiguous) as possible. The non-exclusive terms can be added to the final lexicon when bootstrapping is finished if polysemous terms are acceptable to have in the dictionary.</Paragraph> <Paragraph position="10"> Exclusivity filtering is the only step that uses statistics. Two measures determine whether a word is exclusive to a semantic category. First, we use an</Paragraph> <Paragraph position="12"> is the number of times word w was found in the syntactic structure, and S w;c is the number of times word w was found in the syntactic structure collocated with a member of category c.Theevidence measure is the maximum likelihood estimate that a word belongs to a semantic category given that it appears in the targeted syntactic structure (a word is assumed to belong to the category if it is collocated with another category member). Since few words are known category members initially, we use a low threshold value (.25) which simply ensures that a non-trivial proportion of instances are collocated with category members.</Paragraph> <Paragraph position="13"> The second measure that we use, exclusivity,is the number of occurrences found in the given category's prospective list divided by the number of occurrences found in all other categories' prospective</Paragraph> <Paragraph position="15"> is the number of times word w was found in the syntactic structure collocated with a member of category c,andS w;:c is the number of times word w was found in the syntactic structure collocated with a member of a different semantic class. We apply a threshold to this ratio to ensure that the term is exclusive to the targeted semantic category.</Paragraph> </Section> <Section position="4" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.3 Experimental Results </SectionTitle> <Paragraph position="0"> We evaluated our system on several semantic categories in two domains. In one set of experiments, we generated lexicons for PEOPLE and ORGANIZATIONS using 2500 Wall Street Journal articles from the Penn Treebank (Marcus et al., 1993). In the second set of experiments, we generated lexicons for PEOPLE, ORGANIZATIONS,andPRODUCTS using approximately 1350 press releases from pharmaceutical companies.</Paragraph> <Paragraph position="1"> Our seeding consisted of 5 proper nouns and 5 general nouns for each semantic category. We used a threshold of 25% for the evidence measure and 5 for the exclusivity ratio. We ran the bootstrapping process until no new words were learned, which ranged from 6-14 iterations depending on the category and syntactic structure.</Paragraph> <Paragraph position="2"> Table 1 shows 10 examples of words learned for each semantic category in each domain. The people and organization lists illustrate (1) how dramatically the vocabulary can differ across domains, and (2) that the lexicons may include domain-specific word meanings that are not the most common meaning</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Koop </SectionTitle> <Paragraph position="0"> Organization (WSJ): parent, subsidiary, distiller, arm, suitor, AMR Corp., ABB ASEA Brown Boveri, J.P. Morgan, James River, Federal</Paragraph> </Section> <Section position="6" start_page="2" end_page="4" type="sub_section"> <SectionTitle> Reserve Board </SectionTitle> <Paragraph position="0"> People (Pharm): surgeon, executive, recipient, co-author, pioneer, Amgen Chief Executive Officer, Barbara Ryan, Chief Scientific Officer Norbert Riedel, Dr. Cole, Analyst Mark Augustine Organization (Pharm): device-maker, drugmaker, licensee, organization, venture, ALR of a word in general. For example, the word &quot;parent&quot; generally refers to a person, but in a financial domain it nearly always refers to an organization. The pharmaceutical product category contains many nouns (e.g., drug names) that may not be in a general purpose lexicon such as WordNet.</Paragraph> <Paragraph position="1"> Tables 2 and 3 show the results of our evaluation. We ran the bootstrapping algorithm on each type of syntactic structure independently. The Total column shows the total number of lexicon entries generated by each syntactic structure. The Correct column contains two accuracy numbers: X/Y. The first value (X) is the percentage of entries that were judged to be correct, and the second value (Y) is the accuracy after removing entries resulting from parser errors.</Paragraph> <Paragraph position="2"> The PNP lexicons were substantially larger than the GN lexicons, in part because we saved full noun For example, our parser frequently mistags adjectives as nouns, so many adjectives were hypothesized to be people. If the parser had tagged them correctly, they would not have been allowed in the lexicon.</Paragraph> <Paragraph position="3"> phrases in the PNP lexicon but only head nouns in the GN lexicon. Probably the main reason, however, is that there are many more proper names associated with most semantic categories than there are general nouns. Consequently, we evaluated the PNP and GN lexicons differently. For the GN lexicons, a volunteer (not one of the authors) labeled every word as correct or incorrect. Due to the large size of the PNP lexicons, we randomly sampled 100 words for each syntactic structure and semantic category and asked volunteers to label these samples. Consequently, the PNP evaluation numbers are estimates of the true accuracy. null The Union column tabulates the results obtained from unioning the lexicons produced by the three syntactic structures independently.</Paragraph> <Paragraph position="4"> Although there is some overlap in their lexicons, we found that many different words are being learned. This indicates that the three syntactic structures are tapping into different parts of the search space, which suggests that combining them in a co-training model could be beneficial.</Paragraph> <Paragraph position="5"> Since the number of words contributed by each syntactic structure varied greatly, we evaluated the Union results for the PNP lexicon by randomly sampling 100 words from the unioned lexicons regardless of which structure generated them. This maintained the same distribution in our evaluation set as exists in the lexicon as a whole. However, this sampling strategy means that the evaluation results in the Union column are not simply the sum of the results in the preceding columns.</Paragraph> </Section> <Section position="7" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 3.4 Co-Training </SectionTitle> <Paragraph position="0"> Co-training (Blum and Mitchell, 1998) is a learning technique which combines classifiers that support different views of the data in a single learning mechanism. The co-training model allows examples learned by one classifier to be used by the other classifiers, producing a synergistic effect. The three syntactic structures that we have discussed provide three different ways to harvest semantically related noun phrases.</Paragraph> <Paragraph position="1"> Figure 2 shows our co-training model, with each syntactic structure serving as an independent classifier. The words hypothesized by each classifier are put into a single PNP lexicon and a single GN lexicon, which are shared by all three classifiers. We used an aggressive form of co-training, where all terms hypothesized by a syntactic structure with frequency are added to the shared lexicon. The threshold ensures some confidence in a term before it is allowed to be used by the other learners. We used a threshold of =3 for the WSJ corpus and =2 for the pharmaceutical corpus since it is substantially smaller. We ran the bootstrapping process until no new words were learned, which was 12 iterations for the WSJ corpus and 10 iterations for the pharmaceutical corpus.</Paragraph> <Paragraph position="2"> classifiers separately). In almost all cases, many additional words were learned using the co-training model. Tables 5 and 6 show the evaluation results for the lexicons produced by co-training. The co-training model produced substantially better coverage, while achieving nearly the same accuracy. One exception was organizations in the pharmaceutical domain, which suffered a sizeable loss in precision.</Paragraph> <Paragraph position="3"> This is most likely due to the co-training loop being too aggressive. If one classifier produces a lot of mistakes (in this case, the compound noun classifier), then those mistakes can drag down the overall accuracy of the lexicon.</Paragraph> </Section> </Section> class="xml-element"></Paper>