File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3034_metho.xml

Size: 2,994 bytes

Last Modified: 2025-10-06 14:09:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3034">
  <Title>Fragments and Text Categorization</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Classification by means of
</SectionTitle>
    <Paragraph position="0"> fragments of documents The class of the whole document is determined as follows. Let us take a document a0 which consists of fragments a1a3a2 , . . . , a1a5a4 such that a0a7a6a9a8 a4a10a12a11  on the length of the documenta0 and on the number of sentences in the fragments. Leta31a32a6a34a33a35a1a2a16a37a36a37a36a37a36a38a16a1a4a40a39 , and a41 denotes the set of possible classes. We than use the learned model to assign a class a42a44a43a45a1a47a46a49a48a50a41 to each of the fragments a1a51a48a52a31 . Leta53a54a43a45a1a16a42a44a43a45a1a47a46a55a46 be the confidence of the classification fragment a1 into the class a42a44a43a45a1a56a46 . This confidence measure is computed as an estimated probability of the predicted class. Then for each fragmenta1a57a48a20a31 classified to the classa42a58a48a59a41 we define a42a44a43a45a31 a16a42a38a46a23a6a7a33a35a1a49a48a60a31a62a61a42a44a43a45a1a47a46a58a6a63a42a39 . The confidence of the classification of the whole documenta0 intoa42 is computed as follows</Paragraph>
    <Paragraph position="2"> Finally, the class a42a95a43a99a0a58a46 which is assigned to a documenta0 is computed according to the following definition: null</Paragraph>
    <Paragraph position="4"> In other words, a document a0 is classified to a a42a117a48a52a41 , which was assigned to the most fragments from a31 (the most frequent class). If there are two classes with the same cardinality, the confidence measure a64 a43a45a31 a16a42a38a46 is employed. We also tested another method that exploited the confidence of classification but the results were not satisfactory.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> For feature (i.e. significant word) selection, we tested four methods (Forman, 2002; Yang and Liu, 1999) - Chi-Squared (chi), Information Gain (ig), Fa2 -measure (f1) and Probability Ratio (pr). Eventually, we chose ig because it yielded the best results. We utilized three learning algorithms from the Weka4 system - the decision tree learner J48, the Na&amp;quot;ive Bayes, the SVM Sequential Minimal Optimization (SMO). All the algorithms were used with default settings. The entire documents have been split to fragments containing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, and 40 sentences. For the skip-tail classification which uses only the beginnings of documents we also employed these values.</Paragraph>
    <Paragraph position="1"> As an evaluation criterion we used the accuracy defined as the percentage of correctly classified documents from the test set. All the results have been obtained by a 10-fold cross validation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML