File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1118_intro.xml

Size: 2,383 bytes

Last Modified: 2025-10-06 14:02:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1118">
  <Title>Do We Need Chinese Word Segmentation for Statistical Machine Translation?</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In Chinese texts, words composed of single or multiple characters, are not separated by white spaces, which is different from most of the western languages. This is problematic for many natural language processing tasks. Therefore, the usual method is to segment a Chinese character sequence into Chinese &amp;quot;words&amp;quot;.</Paragraph>
    <Paragraph position="1"> Many investigations have been performed concerning Chinese word segmentation. For example, (Palmer, 1997) developed a Chinese word segmenter using a manually segmented corpus. The segmentation rules were learned automatically from this corpus. (Sproat and Shih, 1990) and (Sun et al., 1998) used a method that does not rely on a dictionary or a manually segmented corpus. The characters of the unsegmented Chinese text are grouped into pairs with the highest value of mutual information. This mutual information can be learned from an unsegmented Chinese corpus.</Paragraph>
    <Paragraph position="2"> We will present a new method for segmenting the Chinese text without using a manually segmented corpus or a predefined dictionary. In statistical machine translation, we have a bilingual corpus available, which is used to obtain a segmentation of the Chinese text in the following way. First, we train the statistical translation models with the unsegmented bilingual corpus. As a result, we obtain a mapping of Chinese characters to the corresponding English wordsforeachsentencepair. Byusingthismapping, we can extract a dictionary automatically. With this self-learned dictionary, we use a segmentation tool to obtain a segmented Chinese text. Finally, we retrain our translation system with the segmented corpus.</Paragraph>
    <Paragraph position="3"> Additionally, we have performed experiments without explicit word segmentation. In this case, each Chinese character is interpreted as one &amp;quot;word&amp;quot;. Based on word groups, our machine translation system is able to work without a word segmentation, while having only a minor translation quality relative loss of less than 5%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML