File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/95/p95-1038_evalu.xml
Size: 2,251 bytes
Last Modified: 2025-10-06 14:00:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1038"> <Title>Evaluation of Semantic Clusters</Title> <Section position="5" start_page="285" end_page="285" type="evalu"> <SectionTitle> 4 Results and Discussion </SectionTitle> <Paragraph position="0"> In one of our experiments, the 400 most frequent nouns in the Merck Veterinary Manual were clustered. Three experts were used to evaluate the generated noun clusters. Some examples of the classes that were generated by the system for the veterinary medicine domain are PROBLEM, TREAT-MENT, ORGAN, DIET, ANIMAL, MEASURE-MENT, PROCESS, and so on. The results obtained by comparing these noun classes to the clusterings provided by three different experts are shown in Table 3. We have also experimented with the use of WordNet to improve the classes obtained by a distributional technique. Some initial experiments have shown that WordNet consistently improves the F-measures for these noun classes by about 0.05 on an average. Details of these experiments can be found in (Agarwal, 1995).</Paragraph> <Paragraph position="1"> It is our belief that the evaluation scheme presented in this paper is useful for comparing different clusterings produced by the same system or those produced by different systems against one provided by an expert. The resulting precision, recall, and F-measure should not be treated as a kind of &quot;gold standard&quot; to represent the quality of these classes in some absolute sense. It has been our experience that, as semantic clustering is a highly subjective task, evaluating a given clustering against different experts may yield numbers that vary considerably.</Paragraph> <Paragraph position="2"> However, when different clusterings generated by a system are compared against the same expert (or the same set of experts), such relative comparisons are useful.</Paragraph> <Paragraph position="3"> The evaluation scheme presented here still suffers from one major limitation -- it is not capable of evaluating a hierarchy generated by a system against one provided by an expert. Such evaluations get complicated because of the restriction of one-to-one mapping. More work definitely needs to be done in this area.</Paragraph> </Section> class="xml-element"></Paper>