File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1025_intro.xml
Size: 5,707 bytes
Last Modified: 2025-10-06 14:01:41
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1025"> <Title>Language and Task Independent Text Categorization with Simple Language Models</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text categorization concerns the problem of automatically assigning given text passages (paragraphs or documents) into predefined categories. Due to the rapid explosion of texts in digital form, text categorization has become an important area of research owing to the need to automatically organize and index large text collections in various ways. Such techniques are currently being applied in many areas, including language identification, authorship attribution (Stamatatos et al., 2000), text genre classification (Kesseler et al., 1997; Stamatatos et al., 2000), topic identification (Dumais et al., 1998; Lewis, 1992; McCallum, 1998; Yang, 1999), and subjective sentiment classification (Turney, 2002).</Paragraph> <Paragraph position="1"> Many standard machine learning techniques have been applied to automated text categorization problems, such as naive-Bayes classifiers, support vector machines, linear least squares models, neural networks, and K-nearest neighbor classifiers (Yang, 1999; Sebastiani, 2002). A common aspect of these approaches is that they treat text categorization as a standard classification problem, and thereby reduce the learning process to two simple steps: feature engineering, and classification learning over the feature space. Of these two steps, feature engineering is critical to achieving good performance in text categorization problems. Once good features are identified, almost any reasonable technique for learning a classifier seems to perform well (Scott, 1999).</Paragraph> <Paragraph position="2"> Unfortunately, the standard classification learning methodology has several drawbacks for text categorization. First, feature construction is usually language dependent. Various techniques such as stop-word removal or stemming require language specific knowledge to design adequately. Moreover, whether one can use a purely word-level approach is itself a language dependent issue.</Paragraph> <Paragraph position="3"> In many Asian languages such as Chinese or Japanese, identifying words from character sequences is hard, and any word-based approach must suffer added complexity in coping with segmentation errors. Second, feature selection is task dependent. For example, tasks like authorship attribution or genre classification require attention to linguistic style markers (Stamatatos et al., 2000), whereas topic detection systems rely more heavily on bag of words features. Third, there are an enormous number of possible features to consider in text categorization problems, and standard feature selection approaches do not always cope well in such circumstances. For example, given an enormous number of features, the cumulative effect of uncommon features can still have an important effect on classification accuracy, even though infrequent features contribute less information than common features individually. Consequently, throwing away uncommon features is usually not an appropriate strategy in this domain (Aizawa, 2001). Another problem is that feature selection normally uses indirect tests, such as '2 or mutual information, which involve setting arbi- null trary thresholds and conducting a heuristic greedy search to find good feature sets. Finally, by treating text categorization as a classical classification problem, standard approaches can ignore the fact that texts are written in natural language, meaning that they have many implicit regularities that can be well modeled with specific tools from natural language processing.</Paragraph> <Paragraph position="4"> In this paper, we propose a straightforward text categorization learning method based on learning categoryspecific, character-level, n-gram language models. Although this is a very simple approach, it has not yet been systematically investigated in the literature. We find that, surprisingly, we obtain competitive (and often superior) results to more sophisticated learning and feature construction techniques, while requiring almost no feature engineering or pre-processing. In fact, the overall approach requires almost no language specific or task specific pre-processing to achieve effective performance.</Paragraph> <Paragraph position="5"> The success of this simple method, we think, is due to the effectiveness of well known statistical language modeling techniques, which surprisingly have had little significant impact on the learning algorithms normally applied to text categorization. Nevertheless, statistical language modeling is also concerned with modeling the semantic, syntactic, lexicographical and phonological regularities of natural language--and would seem to provide a natural foundation for text categorization problems. One interesting difference, however, is that instead of explicitly pre-computing features and selecting a subset based on arbitrary decisions, the language modeling approach simply considers all character (or word) subsequences occurring in the text as candidate features, and implicitly considers the contribution of every feature in the final model. Thus, the language modeling approach completely avoids a potentially error-prone feature selection process. Also, by applying character-level language models, one also avoids the word segmentation problems that arise in many Asian languages, and thereby achieves a language independent method for constructing accurate text categorizers.</Paragraph> </Section> class="xml-element"></Paper>