File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-3034_evalu.xml

Size: 4,333 bytes

Last Modified: 2025-10-06 13:59:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3034">
  <Title>Fragments and Text Categorization</Title>
  <Section position="6" start_page="0" end_page="91" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 General
</SectionTitle>
      <Paragraph position="0"> We observed that for both skip-tail and fragments there is always a consistent size of fragments for which the accuracy increased. It is the most important result. More details can be found in the next two paragraphs.</Paragraph>
      <Paragraph position="1"> Among the learning algorithms, the highest accuracy was achieved for all the three languages with the Na&amp;quot;ive Bayes. It is surprising because for full versions of documents it was the SMO algorithm that was even slightly better than the Na&amp;quot;ive Bayes in terms of accuracy. On the other hand, the highest impact was observed for J48. Thus, for instance for Czech, it was observed for fragmentsthat the accuracy was higher for 14 out of 15 tasks when J48 had been used, and for 12 out of 15 in the case of the Na&amp;quot;ive Bayes and the Support Vector Machines. However, the performance of J48 was far inferior to that of the other algorithms. In only three tasks J48</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 skip-tail
</SectionTitle>
      <Paragraph position="0"> skip-tail method was successful for all the three languages (see Table 2). It results in increased accuracy even for a very small initial fragment. In Figure 1 there are results for skip-tail and initial fragments of the length from 40% up to 100% of the average length of documents in the learning set.</Paragraph>
      <Paragraph position="1"> n NB stail lngth incr</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Na&amp;quot;ive Bayes (n=number of classification tasks,
</SectionTitle>
      <Paragraph position="0"> NB=average of error rates for full documents, stail=average of error rates for skip-tail, lngth=optimal length of the fragment, incr=number of tasks with the increase of accuracy: +, ++ means significant on level 95% resp 99%, the sign test.) For example, for English, taking only the first 40% of sentences in a document results in a slightly increased accuracy. Figure 2 displays the relative increase of accuracy for fragments of the length up to 40 sentences for different learning algorithms for English. It is important to stress that even for the initial fragment of the length of 5 sentences, the accuracy is the same as for full documents. When the initial fragment is longer the classification accuracy further increase until the length of 12 sentences.</Paragraph>
      <Paragraph position="1"> We observed similar behaviour for skip-tail when employed on other languages, and also for the fragments method.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="91" type="sub_section">
      <SectionTitle>
5.3 fragments
</SectionTitle>
      <Paragraph position="0"> This method was successful for classifying English and Czech documents (significant on level 99% for English and 95% for Czech). In the case of French cooking recipes, a small, but not significant impact has been observed, too. This may have been caused by the special format of recipes.</Paragraph>
      <Paragraph position="1"> n NB frag lngth incr</Paragraph>
    </Section>
    <Section position="5" start_page="91" end_page="91" type="sub_section">
      <SectionTitle>
5.4 Optimal length of fragments
</SectionTitle>
      <Paragraph position="0"> We also looked for the optimal length of fragments.</Paragraph>
      <Paragraph position="1"> We found that for the lengths of fragments for the range about the average document length (in the learning set), the accuracy increased for the significant number of the data sets (the sign test 95%). It holds for skip-tail and for all languages. and for English and Czech in the case of fragments.</Paragraph>
      <Paragraph position="2"> However, an increase of accuracy is observed even for 60% of the average length (see Fig. 1). Moreover, for the average length this increase is significant for Czech at a level 95% (t-test).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML