File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1038_metho.xml
Size: 8,146 bytes
Last Modified: 2025-10-06 14:14:08
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1038"> <Title>Evaluation of Semantic Clusters</Title> <Section position="3" start_page="0" end_page="284" type="metho"> <SectionTitle> 2 The Need </SectionTitle> <Paragraph position="0"> Although there has been a lot of work done in extracting semantic classes of a given domain, relatively little attention has been paid to the task of evaluating the generated classes. In the absence of an evaluation scheme, the only way to decide if the semantic classes produced by a system are &quot;reasonable&quot; or not is by having an expert analyze them by inspection. Such informal evaluations make it very difficult to compare one set of classes against another and are also not very reliable estimates of the quality of a set of classes. It is clear that a formal evaluation scheme would be of great help.</Paragraph> <Paragraph position="1"> Hatzivassiloglou and McKeown (1993) duster adjectives into partitions and present an interesting evaluation to compare the generated adjective classes against those provided by an expert. Their evaluation scheme bases the comparison between two classes on the presence or absence of pairs of words in them. Their approach involves filling in a YES-NO contingency table based on whether a pair of words (adjectives, in their case) is classified in the same class by the human expert and by the system.</Paragraph> <Paragraph position="2"> This method works very well for partitions. However, if it is used to evaluate sets of classes where the classes may be potentiaily overlapping, their technique yields a weaker measure since the same word pair could possibly be present in more than one class.</Paragraph> <Paragraph position="3"> An ideal scheme used to evaluate semantic classes should be able to handle overlapping classes (as o1>.</Paragraph> <Paragraph position="4"> posed to partitions) as well as hierarchies. The technique proposed by Hatzivassiloglou and McKeown does not do a good job of evaluating either of these.</Paragraph> <Paragraph position="5"> In this paper, we present an evaluation methodology which makes it possible to properly evaluate over- null lapping classes. Our scheme is also capable of incorporating hierarchies provided by an expert into the evaluation, but still lacks the ability to compare hierarchies against hierarchies.</Paragraph> <Paragraph position="6"> In the discussion that follows, the word &quot;clustering&quot; is used to refer to the set of classes that may be either provided by an expert or generated by the system, and the word &quot;class&quot; is used to refer to a single class in the clustering.</Paragraph> </Section> <Section position="4" start_page="284" end_page="285" type="metho"> <SectionTitle> 3 Evaluation Approach </SectionTitle> <Paragraph position="0"> As mentioned above, we intend to be able to compare a clustering generated by a system against one provided by an expert. Since a word can occur in more than one class, it is important to find some kind of mapping between the classes generated by the system and the classes given by the expert. Such a mapping tells us which class in the system's clustering maps to which one in the expert's clustering, and an overall comparison of the clusterings is based on the comparison of the mutually mapping classes.</Paragraph> <Paragraph position="1"> Before we delve deeper into the evaluation process, we must decide on some measure of &quot;closeness&quot; between a pair of classes. We have adopted the F-measure (Hatzivassiloglou and McKeown, 1993; Chincor, 1992). In our computation of the Fmeasure, we construct a contingency table based on the presence or absence of individual elements in the two classes being compared, as opposed to basing it on pairs of words. For example, suppose that Class A is generated by the system and Class B is provided by an expert (as shown in Table 1). The contingency table obtained for this pair of classes is shown in Table 2.</Paragraph> <Paragraph position="2"> The three main steps in the evaluation process are the acquisition of &quot;correct&quot; classes from domain experts, mapping the experts' clustering to that generated by the system, and generating an overall measure that represents the system's performance when compared against the expert.</Paragraph> <Section position="1" start_page="284" end_page="284" type="sub_section"> <SectionTitle> 3.1 Knowledge Acquisition from Experts </SectionTitle> <Paragraph position="0"> The objective of this step is to get human experts to undertake the same task that the system performs, i.e., classifying a set of words into several potentially overlapping classes. The classes produced by a system are later compared to these &quot;correct&quot; classifications provided by the expert.</Paragraph> </Section> <Section position="2" start_page="284" end_page="285" type="sub_section"> <SectionTitle> 3.2 Mapping Algorithm </SectionTitle> <Paragraph position="0"> In order to determine pairwise mappings between the clustering generated by the system and one provided by an expert, a table of F-measures is constructed, with a row for each class generated by the system, and a column for every class provided by the expert. Note that since the expert actually provides a hierarchy, there is one column corresponding to every individual class and subclass provided by the expert. This allows the system's classes to map to a class at any level in the expert's hierarchy. This table gives an estimate of how well each class generated by the system maps to the ones provided by the expert.</Paragraph> <Paragraph position="1"> The algorithm used to compute the actual mappings from the F-measure table is briefly described here. In each row of the table, mark the cell with the highest F-measure as a potential mapping. In general, conflicts arise when more than one class generated by the system maps to a given class provided by the expert. In other words, whenever a column in the table has more than one cell marked as a potential mapping, a conflict is said to exist. To resolve a conflict, one of the system classes must be re-mapped. The heuristic used here is that the class for which such a re-mapping results in minimal loss of F-measure is the one that must be re-mapped.</Paragraph> <Paragraph position="2"> Several such conflicts may exist, and re-mapping may lead to further conflicts. The mapping algorithm iteratively searches for conflicts and resolves them till no more conflicts exist. Note also that a system class may map to an expert class only if the F-measure between them exceeds a certain threshold value. This ensures that a certain degree of similarity must exist between two classes for them to map to each other. We have used a threshold value of 0.20. This value is obtained purely by observations made on the F-measures between different pairs of classes with varying degrees of similarity.</Paragraph> </Section> <Section position="3" start_page="285" end_page="285" type="sub_section"> <SectionTitle> 3.3 Computation of the Overall F-measure </SectionTitle> <Paragraph position="0"> Once the mappings have been determined between the clusterings of the system and the expert, the next step is to compute the F-measure between the two clusterings. Rather than populating separate contingency tables for every pair of classes, construct a single contingency table. For every pairwise mapping found for the classes in these two clusterings, populate the YES-YES, YES-NO, and NO-YES cells of the contingency table appropriately (see Table 2).</Paragraph> <Paragraph position="1"> Once all the mapped classes have been incorporated into this contingency table, add every element of all unmapped classes generated by the system to the YES-NO cell and every element of all unmapped classes provided by the expert to the NO-YES cell of this table. Once all classes in the two clusterings have been accounted for, calculate the precision, recall, and F-measure as explained in (Hatzivassiloglou and McKeown, 1993).</Paragraph> </Section> </Section> class="xml-element"></Paper>