File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1106_intro.xml
Size: 3,990 bytes
Last Modified: 2025-10-06 14:02:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1106"> <Title>Text Classi cation in Asian Languages without Word Segmentation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text classi cation addresses the problem of assigning a given passage of text (or a document) to one or more prede ned classes. This is an important area of information retrieval research that has been heavily investigated, although most of the research activity has concentrated on English text (Dumais, 1998; Yang, 1999). Text classi cation in Asian languages such as Chinese and Japanese, however, is also an important (and relatively more recent) area of research that introduces a number of additional dif culties. One dif culty with Chinese and Japanese text classi cation is that, unlike English, Chinese and Japanese texts do not have explicit whitespace between words. This means that some form of word segmentation is normally required before further processing. However, word segmentation itself is a dif cult problem in these languages. A second dif culty is that there is a lack of standard benchmark data sets for these languages. Nevertheless, recently, there has been signi cant notable progress on Chinese and Japanese text classi cation (Aizawa, 2001; He et al., 2001).</Paragraph> <Paragraph position="1"> Many standard machine learning techniques have been applied to text categorization problems, such as naive Bayes classi ers, support vector machines, linear least squares models, neural networks, and k-nearest neighbor classi ers (Sebastiani, 2002; Yang, 1999). Unfortunately, most current text classiers work with word level features. However, word identi cation in Asian languages, such as Chinese and Japanese, is itself a hard problem. To avoid the word segmentation problems, character level a7 -gram models have been proposed (Cavnar and Trenkle, 1994; Damashek, 1995). There, they used a7 -grams as features for a traditional feature selection process and then deployed classi ers based on calculating feature-vector similarities. This approach has many shortcomings. First, there are an enormous number of possible features to consider in text categorization, and standard feature selection approaches do not always cope well in such circumstances. For example, given a suf ciently large number of features, the cumulative effect of uncommon features can still have an important effect on classi cation accuracy, even though infrequent features contribute less information than common features individually. Therefore, throwing away uncommon features is usually not an appropriate strategy in this domain (Aizawa, 2001). Another problem is that feature selection normally uses indirect tests, such as a8a10a9 or mutual information, which involve setting arbitrary thresholds and conducting a heuristic greedy search to nd a good subset of features. Moreover, by treating text categorization as a classical classi cation problem, standard approaches can ignore the fact that texts are written in natural language, which means that they have many implicit regularities that can be well modeled by speci c tools from natural language processing.</Paragraph> <Paragraph position="2"> In this paper, we present a simple text categorization approach based on statistical a7 -gram language modeling to overcome the above shortcomings in a principled fashion. An advantage we exploit is that the language modeling approach does not discard low frequency features during classi cation, as is commonly done in traditional classi cation learning approaches. Also, the language modeling approach uses a7 -gram models to capture more contextual information than standard bag-of-words approaches, and employs better smoothing techniques than standard classi cation learning. These advantages are supported by our empirical results on Chinese and Japanese data.</Paragraph> </Section> class="xml-element"></Paper>