File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1909_evalu.xml

Size: 3,834 bytes

Last Modified: 2025-10-06 13:59:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1909">
  <Title>Mining Linguistically Interpreted Texts</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Text Categorization
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the results for text categorization of PD1, given by the average error rates considering the three versions the corpus (V1, V2 and V3). We had around 20% of error for the categorization task. We can see minor variations in the results according to the size of the vectors. Best results were obtained for 150 terms.</Paragraph>
      <Paragraph position="1">  grammatical combinations in PD2, while Figure 1 summarizes the lowest error rates found for PD1 and all groups of PD2. The group nouns and adjectives presents the lower error rates of all experiments (18,01). However, due to the small size of the corpus, the improvement reported between usual methods (18,01) and nounsadjectives (20,47), when considering the same number of terms (90), are at 75-80% confidence level only (t-test).</Paragraph>
      <Paragraph position="2"> In general, the results show that the presence of nouns is crucial, the worst classification errors are based on groups that do not contain the category nouns, and here the confidence level for the differences reported reaches 95%. The groups containing nouns present results comparable to those found in the experiments based on usual methods of pre-processing. The use of verbs, either alone or with other grammatical groups is not an interesting option.</Paragraph>
      <Paragraph position="3">  It can be observed that usually the best results are obtained when the documents are represented by a larger number of terms (90, 120 and 150), for the group nouns, however, the best results were obtained for vectors containing just 60 terms.</Paragraph>
      <Paragraph position="4">  We looked at the terms resulting from different selection methods and categories to check the overlap among the groups. From PD1 to PD2 based on nouns and adjectives (the one with the best results) we could see that we had around 50% of different terms. That means that 50% of terms in PD1 are terms included in the categories nouns and adjectives and 50% of the terms selected on the basis of stop-words and stemming are from other grammatical categories. As adjectives added to nouns improved the results, we checked adjectives to figure out their significance. We found terms such as Brazilian, electoral, multimedia, political. Intuitively, these terms seem to be relevant for the classes we had. Analysing the groups containing verbs, we observed that the verbs are usually very common or auxiliary verbs (such as to be, to have, to say), therefore not relevant for classification.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Text Clustering
</SectionTitle>
      <Paragraph position="0"> We tested our hypothesis through clustering experiments for PD1 and variations of PD2. For the experiments on clustering we used vectors containing 150 features from V2 and we set k to 5 groups. The resulting confusion matrix for PD1 is presented in Table 3.</Paragraph>
      <Paragraph position="1">  Considering the larger group in each row and column (highlighted in the table) as the intended cluster for each class, the corresponding precision is of 50,52%.</Paragraph>
      <Paragraph position="2"> We repeated the same set of experiments for PD2. We tested several grammatical groups, the best result was related to nouns and proper names. The results are shown in Tables 4. The corresponding precision is 63,15%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML