File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1017_intro.xml
Size: 2,979 bytes
Last Modified: 2025-10-06 14:02:06
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1017"> <Title>Splitting Input Sentence for Machine Translation Using Language Model with Sentence Similarity</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We are exploring methods to boost the translation quality of corpus-based Machine Translation (MT) systems for speech translation. Among them, the technique of splitting an input sentence and translating the split sentences appears promising (Doi and Sumita, 2003).</Paragraph> <Paragraph position="1"> An MT system sometimes fails to translate an input correctly. Such a failure occurs particularly when an input is long. In such a case, by splitting the input, translation may be successfully performed for each portion. Particularly in a dialogue, sentences tend not to have complicated nested structures, and many long sentences can be split into mutually independent portions. Therefore, if the splitting positions and the translations of the split portions are adequate, the possibility that the arrangement of the translations can provide an adequate translation of the complete input is relatively high. For example, the input sentence, &quot;This is a medium size jacket I think it's a good size for you try it on please&quot; can be split into three portions, &quot;This is a medium size jacket&quot;, &quot;I think it's a good size for you&quot; and &quot;try it on please&quot;. In this case, translating the three portions and arranging the results in the same order give us the translation of the input sentence.</Paragraph> <Paragraph position="2"> In previous research on splitting sentences, many methods have been based on word-sequence characteristics like N-gram (Lavie et al., 1996; Berger et al., 1996; Nakajima and Yamamoto, 2001; Gupta et al., 2002). Some research efforts have achieved high performance in recall and precision against correct splitting positions. Despite such a high performance, from the view point of translation, MT systems are not always able to translate the split sentences well.</Paragraph> <Paragraph position="3"> In order to supplement sentence splitting based on word-sequence characteristics, this paper introduces another measure of sentence similarity. In our splitting method, we generate candidates for splitting positions based on N-grams, and select the best combination of positions by measuring sentence similarity. This selection is based on the assumption that a corpus-based MT system can correctly translate a sentence that is similar to a sentence in its training corpus.</Paragraph> <Paragraph position="4"> The following sections describe the proposed splitting method, present experiments using two Example-Based Machine Translation (EBMT) systems, and evaluate the effect of introducing the similarity measure on translation quality.</Paragraph> </Section> class="xml-element"></Paper>