File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1003_evalu.xml

Size: 6,519 bytes

Last Modified: 2025-10-06 13:59:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1003">
  <Title>Language and Computation</Title>
  <Section position="7" start_page="22" end_page="24" type="evalu">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> For a qualitative idea of the behavior of our classifier, the best attributes for some concepts are listed in Appendix A. We concentrate here on quantitative analyses.</Paragraph>
    <Section position="1" start_page="22" end_page="23" type="sub_section">
      <SectionTitle>
6.1 Classifier Evaluation 1: Cross-Validation
</SectionTitle>
      <Paragraph position="0"> Our two classifiers were evaluated, first of all, using 10-fold cross-validation. The 2-way classifier correctly classified 81.82% of the candidate attributes (the baseline accuracy is 80.61%). The 5-way classifier correctly classified 80.35% of the attributes (the baseline accuracy is 23.55%). The precision / recall results are shown in Table 5.</Paragraph>
      <Paragraph position="1">  As it can be seen from Table 5, both classifiers achieve good F values for all classes except for the non-attribute class: F-measures range from 81% to 95%. With the 2-way classifier, the valid attribute class has an F-measure of 89.2%. With the 5-way classifier, related-agent is the most accurate class (F = 95%) followed by part &amp; related-object, activity, and quality (86.2%, 84.9%, and 81.0%,  respectively). With non-attribute, however, we find an F of 41.7% in the 2-way classification, and 53.8% in the 5-way classification. This suggests that the best strategy for lexicon building would be to use these classifiers to 'find' attributes rather than 'filter' non-attributes.</Paragraph>
    </Section>
    <Section position="2" start_page="23" end_page="23" type="sub_section">
      <SectionTitle>
6.2 Classifier Evaluation 2: Human Judges
</SectionTitle>
      <Paragraph position="0"> Next, we evaluated the accuracy of the attribute classifiers against two human judges (the authors).</Paragraph>
      <Paragraph position="1"> We randomly selected a concept from each of the 21 classes in the balanced dataset. Next, we used the classifiers to classify the 20 best candidate attributes of each concept, as determined by their t-test scores. Then, the judges decided if the assigned classes are correct or not. For the 5-way classifier, the judges also assigned the correct class if the automatic assigned class is incorrect.</Paragraph>
      <Paragraph position="2"> After a preliminary examination we decided not to consider two troublesome concepts: constructor and future. The reason for eliminating constructor is that we discovered it is ambiguous: in addition to the sense of 'a person who builds things', we discovered that constructor is used widely in the Web as a name for a fundamental method in object oriented programming languages such as Java.</Paragraph>
      <Paragraph position="3"> Most of the best candidate attributes (e.g., call, arguments, code, and version) related to the latter sense, that doesn't exist in WordNet. Our system is currently not able to do word sense discrimination, but we are currently working on this issue. The reason for ignoring the concept future was that this word is most commonly used as a modifier in phrases such as: &amp;quot;the car of the future&amp;quot;, and &amp;quot;the office of the future&amp;quot;, and that all of the best candidate attributes occurred in this type of construction. This reduced the number of evaluated concepts to 19.</Paragraph>
      <Paragraph position="4"> According to the judges, the 2-way classifier was on average able to correctly assign attribute classes for 82.57% of the candidate attributes. This is very close to its performance in evaluation 1.</Paragraph>
      <Paragraph position="5"> The results using the F-measure reveal similar results too. Table 6 shows the results of the two classifiers based on the precision and recall measures. According to the judges, the 5-way classifier correctly classified 68.72% on average. This performance is good but not as good as its performance in evaluation 1 (80.35%). The decrease in the performance was also shown in the F-measure.</Paragraph>
      <Paragraph position="6"> The F-measure ranges from 0.712 to 0.839 excluding the non-attribute class.</Paragraph>
      <Paragraph position="7">  for the two classifiers An important question when using human judges is the degree of agreement among them.</Paragraph>
      <Paragraph position="8"> The K-statistic was used to measure this agreement. The values of K are shown in Table 7. In the 2-way classification, the judges agreed on 89.84% of the cases. On the other hand, the K-statistic for this classification task is 0.452. This indicates that part of this strong agreement is because that the majority of the candidate attributes are valid attributes. It also shows the difficulty of identifying non-attributes even for human judges. In the 5-way classification, the two judges have a high level of agreement; Kappa statistic is 0.749. The judges and the 5-way classifier agreed on 63.71% of the cases.</Paragraph>
    </Section>
    <Section position="3" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
6.3 Re-Clustering the Balanced Dataset
</SectionTitle>
      <Paragraph position="0"> Finally, we looked at whether using the classifiers results in a better lexical description for the purposes of clustering (Almuhareb and Poesio, 2004).</Paragraph>
      <Paragraph position="1"> In Table 8 we show the results obtained using the output of the 2-way classifier to re-cluster the 402 concepts of our balanced dataset, comparing these results with those obtained using all attributes (first column) and all attributes that remain after frequency cutoff and POS filtering (column 2). The results are based on the CLUTO evaluation meas- null ures: Purity (which measures the degree of cohesion of the clusters obtained) and Entropy. The purity and entropy formulas are shown in Table 9.</Paragraph>
      <Paragraph position="2">  different sets of attributes Clustering the concepts using only filtered candidate attributes improved the clustering purity from 0.657 to 0.672. This improvement in purity is not significant. However, clustering using only the attributes sanctioned by the 2-way classifier improved the purity further to 0.693, and this improvement in purity from the initial purity was  is the number of concepts from the ith class that were assigned to the rth cluster, n is the number of concepts, and k is the number of clusters.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML