File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1041_metho.xml
Size: 19,110 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1041"> <Title>Automatically Labeling Semantic Classes</Title> <Section position="4" start_page="0" end_page="11" type="metho"> <SectionTitle> 3 Labeling Classes </SectionTitle> <Paragraph position="0"> The research discussed above on discovering hyponym relationships all take a bottom up approach. That is, they use patterns to independently discover semantic relationships of words. However, for infrequent words, these patterns do not match or, worse yet, generate incorrect relationships.</Paragraph> <Paragraph position="1"> Ours is a top down approach. We make use of co-occurrence statistics of semantic classes discovered by algorithms like CBC to label their concepts. Hyponym relationships may then be extracted easily: one hyponym per instance/concept label pair. For example, if we labeled concept (A) from Section 1 with disease, then we could extract is-a relationships such as: diabetes is a disease, cancer is a disease, and lupus is a disease. A concept instance such as lupus is assigned a hypernym disease not because it necessarily occurs in any particular syntactic relationship with disease, but because it belongs to the class of instances that does.</Paragraph> <Paragraph position="2"> The input to our labeling algorithm is a list of semantic classes, in the form of clusters of words, which may be generated from any source. In our experiments, we used the clustering outputs of CBC (Pantel and Lin 2002). The output of the system is a ranked list of concept names for each semantic class.</Paragraph> <Paragraph position="3"> In the first phase of the algorithm, we extract feature vectors for each word that occurs in a semantic class.</Paragraph> <Paragraph position="4"> Phase II then uses these features to compute grammatical signatures of concepts using the CBC algorithm. Finally, we use simple syntactic patterns to discover class names from each class signature. Below, we describe these phases in detail.</Paragraph> <Section position="1" start_page="0" end_page="11" type="sub_section"> <SectionTitle> 3.1 Phase I </SectionTitle> <Paragraph position="0"> We represent each word (concept instance) by a feature vector. Each feature corresponds to a context in which the word occurs. For example, catch __ is a verb-object context. If the word wave occurred in this context, then the context is a feature of wave.</Paragraph> <Paragraph position="1"> We first construct a frequency count vector C(e) =</Paragraph> <Paragraph position="3"> ), where m is the total number of features and c ef is the frequency count of feature f occurring in word e. Here, c ef is the number of times word e occurred in a grammatical context f. For example, if the word wave occurred 217 times as the object of the verb catch, then the feature vector for wave will have value 217 for its object-of catch feature. In Section 4.1, we describe how we obtain these features.</Paragraph> <Paragraph position="4"> We then construct a mutual information vector</Paragraph> <Paragraph position="6"> ) for each word e, where mi ef is the pointwise mutual information between word e and feature f, which is defined as:</Paragraph> <Paragraph position="8"> is the total frequency count of all features of all words. Mutual information is commonly used to measure the association strength between two words (Church and Hanks 1989). A well-known problem is that mutual information is biased towards infrequent elements/features. We therefore multiply mi ef with the following discounting factor:</Paragraph> <Paragraph position="10"> (2)</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.2 Phase II </SectionTitle> <Paragraph position="0"> Following (Pantel and Lin 2002), we construct a committee for each semantic class. A committee is a set of representative elements that unambiguously describe the members of a possible class.</Paragraph> <Paragraph position="1"> For each class c, we construct a matrix containing the similarity between each pair of words e</Paragraph> <Paragraph position="3"> using the cosine coefficient of their mutual information vectors (Salton and McGill 1983):</Paragraph> <Paragraph position="5"> For each word e, we then cluster its most similar instances using group-average clustering (Han and Kamber 2001) and we store as a candidate committee the highest scoring cluster c' according to the following metric:</Paragraph> <Paragraph position="7"> where |c' |is the number of elements in c' and avgsim(c') is the average pairwise similarity between words in c'.</Paragraph> <Paragraph position="8"> The assumption is that the best representative for a concept is a large set of very similar instances. The committee for class c is then the highest scoring candidate committee containing only words from c. For example, below are the committee members discovered for the semantic classes (A), (B), and (C) from Section 1:</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.3 Phase III </SectionTitle> <Paragraph position="0"> By averaging the feature vectors of the committee members of a particular semantic class, we obtain a grammatical template, or signature, for that class. For example, Figure 1 shows an excerpt of the grammatical signature for concept (B) in Section 1. The vector is obtained by averaging the feature vectors for the words Curtis Joseph, John Vanbiesbrouck, Mike Richter, and Tommy Salo (the committee of this concept). The -V:subj:N:sprawl feature indicates a subject-verb relationship between the concept and the verb sprawl while N:appo:N:goaltender indicates an apposition relationship between the concept and the noun goaltender. The (-) in a relationship means that the right hand side of the relationship is the head (e.g. sprawl is the head of the subject-verb relationship). The two columns of numbers indicate the frequency and mutual information score for each feature respectively.</Paragraph> <Paragraph position="1"> In order to discover the characteristics of human naming conventions, we manually named 50 concepts discovered by CBC. For each concept, we extracted the relationships between the concept committee and the assigned label. We then added the mutual information scores for each extracted relationship among the 50 concepts. The top-4 highest scoring relationships are: * Apposition (N:appo:N) e.g. ... Oracle, a company known for its progressive employment policies, ...</Paragraph> <Paragraph position="2"> * Nominal subject (-N:subj:N) e.g. ... Apple was a hot young company, with Steve Jobs in charge.</Paragraph> <Paragraph position="3"> * Such as (-N:such as:N) e.g. ... companies such as IBM must be weary ...</Paragraph> <Paragraph position="4"> * Like (-N:like:N) e.g. ... companies like Sun Microsystems do no shy away from such challenges, ...</Paragraph> <Paragraph position="5"> To name a class, we simply search for these syntactic relationships in the signature of a concept. We sum up the mutual information scores for each term that occurs in these relationships with a committee of a class. The highest scoring term is the name of the class. For example, the top-5 scoring terms that occurred in these relationships with the signature of the concept represented by the committee {Curtis Joseph, John The numbers are the total mutual information scores of each name in the four syntactic relationships.</Paragraph> </Section> </Section> <Section position="5" start_page="11" end_page="11" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> In this section, we present an evaluation of the class labeling algorithm and of the hyponym relationships discovered by our system.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> We used Minipar (Lin 1994), a broad coverage parser, to parse 3GB of newspaper text from the Aquaint (TREC-9) collection. We collected the frequency counts of the grammatical relationships (contexts) output by Minipar and used them to compute the pointwise mutual information vectors described in Section 3.1.</Paragraph> <Paragraph position="1"> We used the 1432 noun clusters extracted by CBC as the list of concepts to name. For each concept, we then used our algorithm described in Section 3 to extract the top-20 names for each concept.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.2 Labeling Precision </SectionTitle> <Paragraph position="0"> Out of the 1432 noun concepts, we were unable to name 21 (1.5%) of them. This occurs when a concepts committee members do not occur in any of the four syntactic relationships described in Section 0. We performed a manual evaluation of the remaining 1411 concepts.</Paragraph> <Paragraph position="1"> We randomly selected 125 concepts and their top-5 highest ranking names according to our algorithm. Table 1 shows the first 10 randomly selected concepts (each concept is represented by three of its committee members).</Paragraph> <Paragraph position="2"> For each concept, we added to the list of names a human generated name (obtained from an annotator looking at only the concept instances). We also appended concept names extracted from WordNet. For each concept that contains at least five instances in the WordNet hierarchy, we named the concept with the most frequent common ancestor of each pair of instances. Up to five names were generated by WordNet for each concept. Because of the low coverage of proper nouns in WordNet, only 33 of the 125 concepts we evaluated had WordNet generated labels.</Paragraph> <Paragraph position="3"> We presented to three human judges the 125 randomly selected concepts together with the system, human, and WordNet generated names randomly ordered.</Paragraph> <Paragraph position="4"> That way, there was no way for a judge to know the source of a label nor the systems ranking of the labels. For each name, we asked the judges to assign a score of correct, partially correct, or incorrect. We then computed the mean reciprocal rank (MRR) of the system, human, and WordNet labels. For each concept, a naming scheme receives a score of 1 / M where M is the rank of the first name judged correct. Table 2 shows the results. Table 3 shows similar results for a more lenient evaluation where M is the rank of the first name judged correct or partially correct.</Paragraph> <Paragraph position="5"> Our system achieved an overall MRR score of 77.1%. We performed much better than the baseline WordNet (19.9%) because of the lack of coverage (mostly proper nouns) in the hierarchy. For the 33 concepts that WordNet named, it achieved a score of 75.3% and a lenient score of 82.7%, which is high considering the simple algorithm we used to extract labels using WordNet.</Paragraph> <Paragraph position="6"> The Kappa statistic (Siegel and Castellan Jr. 1988) measures the agreements between a set of judges assessments correcting for chance agreements: where P(A) is the probability of agreement between the judges and P(E) is the probability that the judges agree by chance on an assessment. An experiment with K [?] 0.8 is generally viewed as reliable and 0.67 < K < 0.8 allows tentative conclusions. The Kappa statistic for our experiment is K = 0.72.</Paragraph> <Paragraph position="7"> The human labeling is at a disadvantage since only one label was generated per concept. Therefore, the human scores either 1 or 0 for each concept. Our systems highest ranking name was correct 72% of the time. Table 4 shows the percentage of semantic classes with a correct label in the top 1-5 ranks returned by our system.</Paragraph> <Paragraph position="8"> Overall, 41.8% of the top-5 names extracted by our system were judged correct. The overall accuracy for the top-4, top-3, top-2, and top-1 names are 44.4%, 48.8%, 58.5%, and 72% respectively. Hence, the name ranking of our algorithm is effective.</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.3 Hyponym Precision </SectionTitle> <Paragraph position="0"> The 1432 CBC concepts contain 18,000 unique words.</Paragraph> <Paragraph position="1"> For each concept to which a word belongs, we extracted up to 3 hyponyms, one for each of the top-3 labels for the concept. The result was 159,000 hyponym relationships. 24 are shown in the Appendix.</Paragraph> <Paragraph position="2"> Two judges annotated two random samples of 100 relationships: one from all 159,000 hyponyms and one from the subset of 65,000 proper nouns. For each instance, the judges were asked to decide whether the hyponym relationship was correct, partially correct or incorrect. Table 5 shows the results. The strict measure counts a score of 1 for each correctly judged instance and 0 otherwise. The lenient measure also gives a score of 0.5 for each instance judged partially correct.</Paragraph> <Paragraph position="3"> Many of the CBC concepts contain noise. For example, the wine cluster: Zinfandel, merlot, Pinot noir, Chardonnay, Cabernet Sauvignon, cabernet, riesling, Sauvignon blanc, Chenin blanc, sangiovese, syrah, Grape, Chianti ...</Paragraph> <Paragraph position="4"> contains some incorrect instances such as grape, appelation, and milk chocolate. Each of these instances will generate incorrect hyponyms such as grape is wine and milk chocolate is wine. This hyponym extraction task would likely serve well for evaluating the accuracy of lists of semantic classes.</Paragraph> <Paragraph position="5"> Table 5 shows that the hyponyms involving proper nouns are much more reliable than common nouns.</Paragraph> <Paragraph position="6"> Since WordNet contains poor coverage of proper nouns, these relationships could be useful to enrich it.</Paragraph> </Section> <Section position="4" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.4 Recall </SectionTitle> <Paragraph position="0"> Semantic extraction tasks are notoriously difficult to evaluate for recall. To approximate recall, we conducted two question answering (QA) tasks: answering definition questions and performing QA information retrieval.</Paragraph> <Paragraph position="1"> We chose the 50 definition questions that appeared in the QA track of TREC2003 (Voorhees, 2003). For example: Who is Aaron Copland? and What is the Kama Sutra? For each question we looked for at most five corresponding concepts in our hyponym list. For example, for Aaron Copland, we found the following hypernyms: composer, music, and gift. We compared our system with the concepts in WordNet and Fleischman et al.s instance/concept relations (Fleischman et al. 2003). Table 6 shows the percentage of correct answers in the top-1 and top-5 returned answers from each system. All systems seem to have similar performance on the top-1 answers, but our system has many more answers in the top-5. This shows that our system has comparatively higher recall for this task.</Paragraph> <Paragraph position="2"> Information (Passage) Retrieval Passage retrieval is used in QA to supply relevant information to an answer pinpointing module. The higher the performance of the passage retrieval module, the higher will be the performance of the answer pinpointing module.</Paragraph> <Paragraph position="3"> The passage retrieval module can make use of the hyponym relationships that are discovered by our system. Given a question such as What color , the likelihood of a correct answer being present in a retrieved passage is greatly increased if we know the set of all possible colors and index them in the document collection appropriately.</Paragraph> <Paragraph position="4"> We used the hyponym relations learned by our system to perform semantic indexing on a QA passage retrieval task. We selected the 179 questions from the QA track of TREC-2003 that had an explicit semantic answer type (e.g. What band was Jerry Garcia with? and What color is the top stripe on the U.S. flag?).</Paragraph> <Paragraph position="5"> For each expected semantic answer type corresponding to a given question (e.g. band and color), we indexed the entire TREC-2002 IR collection with our systems hyponyms.</Paragraph> <Paragraph position="6"> We compared the passages returned by the passage retrieval module with and without the semantic indexing. We counted how many of the 179 questions had a correct answer returned in the top-1 and top-100 passages. Table 7 shows the results.</Paragraph> <Paragraph position="7"> Our system shows small gains in the performance of the IR output. In the top-1 category, the performance improved by 20%. This may lead to better answer selections. null</Paragraph> </Section> </Section> <Section position="6" start_page="11" end_page="11" type="metho"> <SectionTitle> 5 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> Current state of the art concept discovery algorithms generate lists of instances of semantic classes but stop short of labeling the classes with concept names. Class labels would serve useful in applications such as question answering to map a question concept into a semantic class and then search for answers within that class.</Paragraph> <Paragraph position="1"> We propose here an algorithm for automatically labeling concepts that searches for syntactic patterns within a grammatical template for a class. Of the 1432 noun concepts discovered by CBC, our system labelled 98.5% of them with an MRR score of 77.1% in a human evaluation. null Hyponym relationships were then easily extracted, one for each instance/concept label pair. We extracted 159,000 hyponyms and achieved a precision of 68%. On a subset of 65,000 proper names, our performance was 81.5%.</Paragraph> <Paragraph position="2"> This work forms an important attempt to building large-scale semantic knowledge bases. Without being able to automatically name a cluster and extract hyponym/hypernym relationships, the utility of automatically generated clusters or manually compiled lists of terms is limited. Of course, it is a serious open question how many names each cluster (concept) should have, and how good each name is. Our method begins to address this thorny issue by quantifying the name assigned to a class and by simultaneously assigning a number that can be interpreted to reflect the strength of membership of each element to the class. This is potentially a significant step away from traditional all-or-nothing semantic/ontology representations to a concept representation scheme that is more nuanced and admits multiple names and graded set memberships.</Paragraph> </Section> <Section position="7" start_page="11" end_page="11" type="metho"> <SectionTitle> Acknowledgements </SectionTitle> <Paragraph position="0"> The authors wish to thank the reviewers for their helpful comments. This research was partly supported by NSF grant #EIA-0205111.</Paragraph> </Section> class="xml-element"></Paper>