File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-1106_abstr.xml

Size: 1,238 bytes

Last Modified: 2025-10-06 13:43:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1106">
  <Title>Text Classi cation in Asian Languages without Word Segmentation</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present a simple approach for Asian language text classi cation without word segmentation, based on statistical a7 -gram language modeling. In particular, we examine Chinese and Japanese text classication. With character a7 -gram models, our approach avoids word segmentation.</Paragraph>
    <Paragraph position="1"> However, unlike traditional ad hoca7 -gram models, the statistical language modeling based approach has strong information theoretic basis and avoids explicit feature selection procedure which potentially loses signi cantly amount of useful information. We systematically study the key factors in language modeling and their inuence on classi cation. Experiments on Chinese TREC and Japanese NTCIR topic detection show that the simple approach can achieve better performance compared to traditional approaches while avoiding word segmentation, which demonstrates its superiority in Asian language text classi cation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML