File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1204_intro.xml

Size: 2,447 bytes

Last Modified: 2025-10-06 14:03:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1204">
  <Title>Training Data Modification for SMT Considering Groups of Synonymous Sentences</Title>
  <Section position="2" start_page="0" end_page="19" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recently, many researchers have focused their interest on statistical machine translation (SMT) systems, with particular attention given to models and decoding algorithms. The quantity of the training corpus has received less attention, although of course the earlier reports do address the quantity issue. In most cases, the larger the training corpus becomes, the higher accuracy is achieved. Usually, the quantity problem of the training corpus is discussed in relation to the size of the training corpus and system performance; therefore, researchers study line graphs that indicate the relationship between accuracy and training corpus size.</Paragraph>
    <Paragraph position="1"> On the other hand, needless to say, a single sentence in the source language can be used to translate several sentences in the target language. Such various possibilities for translation make MT system development and evaluation very difficult.</Paragraph>
    <Paragraph position="2"> Consequently, here we employ multiple references to evaluate MT systems like BLEU (Papineni et al., 2002) and NIST (Doddington, 2002). Moreover, such variations in translation have a negative effect on training in SMT because when several sentences of input-side language are translated into the exactly equivalent output-side sentences, the probability of correct translation decreases due to the large number of possible pairs of expressions.</Paragraph>
    <Paragraph position="3"> Therefore, if we can restrain or modify the training corpus, the SMT system might achieve high accuracy. null As an example of modification, different output-side sentences paired with the exactly equivalent input-side sentences are replaced with one target sentence. These sentence replacements are required for synonymous sentence sets. Kashioka (2004) discussed synonymous sets of sentences.</Paragraph>
    <Paragraph position="4"> Here, we employ a method to group them as a way of modifying the training corpus for use with SMT.</Paragraph>
    <Paragraph position="5"> This paper focuses on how to control the corpus while giving consideration to synonymous sentence groups.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML