File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/w04-1118_abstr.xml

Size: 1,372 bytes

Last Modified: 2025-10-06 13:43:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1118">
  <Title>Do We Need Chinese Word Segmentation for Statistical Machine Translation?</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In Chinese texts, words are not separated by white spaces. This is problematic for many natural language processing tasks. The standard approach is to segment the Chinese character sequence into words. Here, we investigate Chinese word segmentation for statistical machine translation. We pursue two goals: the first one is the maximization of the final translation quality; the second is the minimization of the manual effort for building a translation system.</Paragraph>
    <Paragraph position="1"> The commonly used method for getting the word boundaries is based on a word segmentation tool and a predefined monolingual dictionary. To avoid the dependence of the translation system on an external dictionary, we have developed a system that learns a domain-specific dictionary from the parallel training corpus. This method produces results that are comparable with the predefined dictionary.</Paragraph>
    <Paragraph position="2"> Further more, our translation system is able to work without word segmentation with only a minor loss in translation quality.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML