File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3239_intro.xml
Size: 4,465 bytes
Last Modified: 2025-10-06 14:02:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3239"> <Title>A Boosting Algorithm for Classification of Semi-Structured Text</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text classification plays an important role in organizing the online texts available on the World Wide Web, Internet news, and E-mails. Until recently, a number of machine learning algorithms have been applied to this problem and have been proven successful in many domains (Sebastiani, 2002).</Paragraph> <Paragraph position="1"> In the traditional text classification tasks, one has to identify predefined text &quot;topics&quot;, such as politics, finance, sports or entertainment. For learning algorithms to identify these topics, a text is usually represented as a bag-of-words, where a text is regarded as a multi-set (i.e., a bag) of words and the word order or syntactic relations appearing in the original text is ignored. Even though the bag-of-words representation is naive and does not convey the meaning of the original text, reasonable accuracy can be obtained. This is because each word occurring in the text is highly relevant to the predefined &quot;topics&quot; to be identified.</Paragraph> <Paragraph position="2"> At present, NTT Communication Science Laboratories, 2-4, Hikaridai, Seika-cho, Soraku, Kyoto, 619-0237 Japan taku@cslab.kecl.ntt.co.jp Given that a number of successes have been reported in the field of traditional text classification, the focus of recent research has expanded from simple topic identification to more challenging tasks such as opinion/modality identification. Example includes categorization of customer E-mails and reviews by types of claims, modalities or subjectivities (Turney, 2002; Wiebe, 2000). For the latter, the traditional bag-of-words representation is not sufficient, and a richer, structural representation is required. A straightforward way to extend the traditional bag-of-words representation is to heuristically add new types of features to the original bag-of-words features, such as fixed-length n-grams (e.g., word bi-gram or tri-gram) or fixed-length syntactic relations (e.g., modifier-head relations). These ad-hoc solutions might give us reasonable performance, however, they are highly task-dependent and require careful design to create the &quot;optimal&quot; feature set for each task.</Paragraph> <Paragraph position="3"> Generally speaking, by using text processing systems, a text can be converted into a semi-structured text annotated with parts-of-speech, base-phrase information or syntactic relations. This information is useful in identifying opinions or modalities contained in the text. We think that it is more useful to propose a learning algorithm that can automatically capture relevant structural information observed in text, rather than to heuristically add this information as new features. From these points of view, this paper proposes a classification algorithm that captures sub-structures embedded in text. To simplify the problem, we first assume that a text to be classified is represented as a labeled ordered tree, which is a general data structure and a simple abstraction of text. Note that word sequence, base-phrase annotation, dependency tree and an XML document can be modeled as a labeled ordered tree.</Paragraph> <Paragraph position="4"> The algorithm proposed here has the following characteristics: i) It performs learning and classification using structural information of text. ii) It uses a set of all subtrees (bag-of-subtrees) for the feature set without any constraints. iii) Even though the size of the candidate feature set becomes quite large, it automatically selects a compact and relevant feature set based on Boosting.</Paragraph> <Paragraph position="5"> This paper is organized as follows. First, we describe the details of our Boosting algorithm in which the subtree-based decision stumps are applied as weak learners. Second, we show an implementation issue related to constructing an efficient learning algorithm. We also discuss the relation between our algorithm and SVMs (Boser et al., 1992) with tree kernel (Collins and Duffy, 2002; Kashima and Koyanagi, 2002). Two experiments on the opinion and modality classification tasks are employed to confirm that subtree features are important.</Paragraph> </Section> class="xml-element"></Paper>