File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1013_evalu.xml

Size: 7,097 bytes

Last Modified: 2025-10-06 13:58:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1013">
  <Title>A Categorial Variation Database for English</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> This section includes two evaluations concerned with different aspects of the CatVar database. The first evaluation calculates the recall and precision of CatVar's clustering and the second determines the contribution of CatVar over Porter stemmer.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 CatVar Clustering Evaluation: Recall and
Precision
</SectionTitle>
      <Paragraph position="0"> To determine the recall and precision of CatVar given the lack of a gold standard, we asked 8 native speakers to evaluate 400 randomly-selected clusters. Each annotator was given a set of 100 clusters (with two annotators per set). Figure 3 shows a segment of the evaluation interface  The annotators were given detailed instructions and many examples to help them with the task. They were asked to classify each word in every cluster as belonging to one of the following categories:  and couldn't be found it in a dictionary.</Paragraph>
      <Paragraph position="1"> The interface also provided an input text box to add missing words to a cluster.</Paragraph>
      <Paragraph position="2"> In calculating the inter-annotator agreement, we did not consider mismatches in word additions as disagreement since some annotators could not think up as many possible variations as others. After all, this was not an evaluation of their ability to think up variations, but rather of the coverage of the CatVar database. The inter-annotator agreement was calculated as the percentage of words where both annotators agreed out of all words. Even though there were six fine-grained classifications, the average inter-annotator agreement was high (80.75%). Many of the disagreements, however, resulted from the fine-grainedness of the options available to the annotators.</Paragraph>
      <Paragraph position="3"> In a second calculation of inter-annotator agreement, we simplified the annotators' choices by placing them into three groups corresponding to Perfect (Perfect and Perfect-but), Not-sure (Not-sure and May-not-be-a-realword) and Wrong (Does-not-belong). This annotationgrouping approach is comparable to the clustering techniques used by (Veronis, 1998) to &amp;quot;super-tag&amp;quot; fine grained annotations. After grouping the annotations, average inter-annotator agreement rose up to 98.35%.</Paragraph>
      <Paragraph position="4"> The cluster modifications produced by each pair of annotators assigned to the same cluster were then combined automatically in an approximation to post-annotation inter-annotator discussion, which traditionally results in agreement: (1) If both annotators agreed on a category, then it stands; (2) One annotator overrides another in cases where one is more sure than the other (i.e., Perfect overrides Perfect-but-with-error/Not-sureand Wrong overrides Not-sure); (3) In cases where one annotator considers a word Perfect while the other annotator considered it Wrong, we compromise at Not-sure. The union of all added words was included in the combined cluster.</Paragraph>
      <Paragraph position="5"> The 400 combined clusters covered 808 words. 68% of the words were ranked as Perfect. None had spelling errors and only one word had a part-of-speech issue. 23 words (less than 3%) were marked as Not-sures. And only 6 words (less than 1%) were marked as Wrong.</Paragraph>
      <Paragraph position="6"> There were 209 added words (about 26%). However 128 words (or 61% of missing words) were not actually missing, but rather not linked into the set of clusters evaluated by a particular annotator. Some of these words were clustered separately in the database.8 The rest of the missing words (81 words or 10% of all words) were not present in the database, but 50 of them (or 62%) were linkable to existing words in the CatVar using simple stemming (e.g., 8The 128 words that were &amp;quot;not really missing&amp;quot; were clustered in 89 other clusters not included in the evaluation sample. the Porter stemmer, whose relevance is described next).</Paragraph>
      <Paragraph position="7"> The precision was calculated as the ratio of perfect words to all original (i.e. not added) words: 91.82%. The recall was calculated as the ratio of perfect words divided by all perfect plus all added words: 72.46%. However, if we exclude the not-really missing words, the adjusted recall value becomes 87.16%. The harmonic mean or Fscore9 of the precision and recall is 81.00% (or 89.43% for adjusted recall).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Linkability Evaluation: Comparison to Porter
Stemmer
</SectionTitle>
      <Paragraph position="0"> To measure the contribution of CatVar with respect to the &amp;quot;linking together&amp;quot; of related words, it is important to define the concept of linkability as the percentage of word-to-word links in the database resulting from a specific source. For example, Natural linkability refers to pairs of words whose form doesn't change across categories such as zipa0 and zipa1 or afghana1 and afghana2a4a3 . Porter linkability refers to words linkable by reduction to a common Porter stem. CatVar linkability is the linkability of two words appearing in the same CatVar cluster.</Paragraph>
      <Paragraph position="1"> Figure 4 shows an example of all three types of links in the hunger cluster. Here, hungera1 and hungera0 are linked in three ways, Naturally (N), by the Porter stemmer (P), and in CatVar (C). Porter links hungrya2a4a3 and hungrinessa1 via the common stem hungri but Porter doesn't link either of these to hungera1 or hungera0 (stem hunger). The total number of links in this cluster is six, two of which are Porter-determinable and only one of which is naturally-determinable.</Paragraph>
      <Paragraph position="2">  The calculation of linkability applies only to the portion of the database containing multi-word clusters (about half of the database) since single-word clusters have zero links. The 48,867 linked words are distributed over 14,731 clusters with 89,638 total number of links. About 12% of these links are naturally-determinable and 70% are Porter-linkable. The last 30% of the links is a significant contribution of the CatVar database, compared to the Porter stemmer, particularly since this stemmer is an industry standard in the IR community.10  It is important to point out that, for CatVar to be used in IR, it must be accompanied by an inflectional analyzer that reduces words to their lexeme form (removing plural endings from nouns or gerund ending from verbs).11 The contribution of CatVar is in its linking of words related derivationally not inflectionally. Work by (Krovetz, 1993) demonstrates an improved performance with derivational stemming over the Porter stemmer most of the time.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML