File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3034_intro.xml
Size: 2,370 bytes
Last Modified: 2025-10-06 14:02:29
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3034"> <Title>Fragments and Text Categorization</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Data </SectionTitle> <Paragraph position="0"> We conducted experiments using English, French and Czech documents. In all cases, the problems referred to a binary document classification. The main characteristics of the data are in Table 1. Three kinds of English documents were used: 20 Newsgroups1 (202 randomly chosen documents from each class were used. The mail header was removed so that the text contained only the body of the message and in some cases, replies) Reuters-21578, Distribution 1.02 (only documents from money-fx, money-supply, trade classified into a single class were chosen). All documents marked as BRIEF and UNPROC were removed. The classification tasks involved money-fx+money-supply vs. trade, money-fx vs. money-supply, money-fx vs. trade and money-supply vs. trade.</Paragraph> <Paragraph position="1"> MEDLINE data3 (235 abstracts of medical papers that concerned gynecology and assisted reproduc- null docs=number of documents, ave a3 =average number of sentences per document, sdev a3 =standard deviation) null The French documents contained French recipes. Examples of the classification tasks are Accompagnements vs. Cremes, Cremes vs. Pates-Pains-Crepes, Desserts vs. Douceurs, Entrees vs. Plats-Chauds and Pates-Pains-Crepes vs. Sauces, among others.</Paragraph> <Paragraph position="2"> We also used both methods for classifying Czech documents. The data involved fifteen classification tasks. The articles used had been taken from Czech newspapers. Six tasks concerned authorship recognition, the other seven to find a document source either a newspaper or a particular page (or column). Topic recognition was the goal of two tasks.</Paragraph> <Paragraph position="3"> The structure of the rest of this paper is as follows. The method for computing the classification of the whole document from classifying fragments (fragments method) is described in Section 3.</Paragraph> <Paragraph position="4"> Experimental settings are introduced in Section 4. Section 5 presents the main results. We conclude with an overview of related works and with directions for potential future research in Sections 6 and 7.</Paragraph> </Section> class="xml-element"></Paper>