File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1070_intro.xml

Size: 3,212 bytes

Last Modified: 2025-10-06 14:06:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1070">
  <Title>Splitting Long or Ill-formed Input for Robust Spoken-language Translation</Title>
  <Section position="3" start_page="0" end_page="421" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A spoken-language translation system requires the ability to treat long or ill-formed input. An utterance as input of a spoken-language translation system, is not always one well-formed sentence. Also, when treating an utterance in speech translation, the speech recognition result which is the input of the translation component, might be corrupted even though the input utterance is well-formed. Such a misrecognized result can cause a parsing failure, and consequently, no translation output would be produced. Furthermore, we cannot expect that a speech recognition result includes punctuation marks such as a comma or a period between words, which are useful information for parsing. 1 As a solution for treating long input, longsentence splitting techniques, such as that of tCurrent affiliation is NTT Communication Science Laboratories.</Paragraph>
    <Paragraph position="1"> 1 Punctuation marks are not used in translation input in this paper.</Paragraph>
    <Paragraph position="2"> Kim (1994), have been proposed. These techniques, however, use many splitting rules written manually and do not treat ill-formed input. Wakita (1997) proposed a robust translation method which locally extracts only reliable parts, i.e., those within the semantic distance threshold and over some word length. This technique, however, does not split input into units globally, or sometimes does not output any translation result.</Paragraph>
    <Paragraph position="3"> This paper proposes an input-splitting method for robust spoken-language translation.</Paragraph>
    <Paragraph position="4"> The proposed method splits input into well-balanced translation units based on a semantic distance calculation. The complete translation result is formed by concatenating the partial translation results of each split unit.</Paragraph>
    <Paragraph position="5"> The proposed method can be incorporated into frameworks that utilize left-to-right parsing and a score for a substructure, In fact, it has been added to Transfer-Driven Machine Translation (TDMT), which was proposed for efficient and robust spoken-language translation (Furuse, 1994; Furuse, 1996). The splitting is performed during TDMT's left-to-right chart parsing strategy, and does not degrade translation efficiency. The proposed method gives TDMT the following advantages: (1) elimination of null outputs, (2) splitting of utterances into sentences, and (3) robust translation of erroneous speech recognition results.</Paragraph>
    <Paragraph position="6"> In the subsequent sections, we will first outline the translation strategy of TDMT. Then, we will explain the framework of our splitting method in Japanese-to-English (JE) and English-to-Japanese (E J) translation. Next, by comparing the TDMT system's performance between two sets of translations with and without using the proposed method, we will demonstrate the usefulness of our method.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML