File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0506_intro.xml

Size: 4,157 bytes

Last Modified: 2025-10-06 14:00:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0506">
  <Title>Pre-processing Closed Captions for Machine Translation</Title>
  <Section position="3" start_page="0" end_page="38" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Machine Translation (MT) technology can be embedded in a device to perform real time translation of closed captions included in TV signals. While speed is one factor associated with the construction of such a device, another factor is the language type and format. The challenges posed by closed captions to MT can be attributed to three distinct characteristics: Firstly, closed captions are transcribed speech. Although closed captions are not a completely faithful transcription of TV programs, they render spoken language and therefore the language used is typically colloquial (Nyberg and Mitamura, 1997). They contain many of the phenomena which characterize spoken language: interjections, repetitions, stuttering, ellipsis, interruptions, hesitations. Linguistically and stylistically they differ from written language: sentences are shorter and poorly structured, and contain idiomatic expressions, ungrammaticality, etc. The associated difficulties stem from the inherently colloquial nature of closed captions, and, to different degrees, of all forms of transcribed speech (Hindle, 1983).</Paragraph>
    <Paragraph position="1"> Such difficulties require a different approach than is taken for written documents.</Paragraph>
    <Paragraph position="2"> Secondly, closed captions come in a specific format, which poses problems for their optimal processing. Closed-captioners may often split a single utterance between two screens, if the character limit for a screen has been exceeded.</Paragraph>
    <Paragraph position="3"> The split is based on consideration about string length, rather than linguistic considerations, hence it can happen at non-constituent boundaries (see Table 1), thus making the real time processing of the separate segments problematic. Another problem is that captions have no upper/lower case distinction. This poses challenges for proper name recognition since names cannot be identified by an initial capital. Additionally, we cannot rely on the initial uppercase letter to identify a sentence initial word. This problematic aspect sets the domain of closed captions apart from most text-to-text MT domains, making it more akin, in this respect, to speech translation systems. Although, from a technical point of view, such input format characteristics could be amended, most likely they are not under a developer's control, hence they have to be presumed.</Paragraph>
    <Paragraph position="4"> Thirdly, closed captions are used under operational constraints. Users have no control over the speed of the image or caption flow so (s)he must comprehend the caption in the limited time that the caption appears on the screen.</Paragraph>
    <Paragraph position="5"> Accordingly, the translation of closed captions is a &amp;quot;time-constrained&amp;quot; application, where the user has limited time to comprehend the system output. Hence, an MT system should produce translations comprehensible within the limited time available to the viewer.</Paragraph>
    <Paragraph position="6"> In this paper we focus on the first two factors, as the third has been discussed in (Toole et al., 1998). We discuss how such domain- null good evening, i'm jim lehrer.</Paragraph>
    <Paragraph position="7"> on the &amp;quot;newshour&amp;quot; tonight, four members of congress debate the u.n. deal with iraq; paul solman tells the troubled story of indonesia's currency; mark shields and paul gigot analyze the political week; and elizabeth farnsworth explains how the universe is getting larger.</Paragraph>
    <Paragraph position="8">  dependent, problematic factors are dealt with in a pre-processing pipeline that prepares the input for processing by a core MT system. The described methods have been implemented for an MT system that translates English closed captions to Spanish and Portuguese. All the examples here refer to the Spanish module.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML