File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0136_intro.xml

Size: 1,813 bytes

Last Modified: 2025-10-06 14:03:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0136">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics N-gram Based Two-Step Algorithm for Word Segmentation</Title>
  <Section position="3" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Word segmentation has been one of the very important problems in the Chinese language processing. It is a necessary in the information retrieval system for the Korean language (Kang and Woo, 2001; Lee et al, 2002). Though Korean words are separated by white spaces, many web users often do not set a space in a sentence when they write a query at the search engine. Another necessity of automatic word segmentation is the index term extraction from a sentence that includes word spacing errors.</Paragraph>
    <Paragraph position="1"> The motivation of this research is to investigate a practical word segmentation system for the Korean language. While we develop the system, we found that ngram-based algorithm was exactly applicable to the Chinese word segmentation and we have participated the bakeoff (Kang and Lim, 2005). The bakeoff result is not satisfiable, but it is acceptable because our method is language independent that does not consider the characteristics of the Chinese language. We do not use any language dependent features except the average length of Chinese words.</Paragraph>
    <Paragraph position="2"> Another advantage of our approach is that it can express the ambiguous word boundaries that are error-prone. So, there are a good possibility of improving the performance if language dependent functionalities are added such as proper name, numeric expression recognizer, and the postprocessing of single character words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML