File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0310_intro.xml

Size: 4,349 bytes

Last Modified: 2025-10-06 14:01:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0310">
  <Title>Bootstrapping Parallel Corpora</Title>
  <Section position="2" start_page="0" end_page="100000" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Statistical translation models (such as those formulated in Brown et al. (1993)) are trained from bilingual sentence-aligned texts. The bilingual data used for constructing translation models is often gathered from government documents produced in multiple languages. For example, the Candide system (Berger et al., 1994) was trained on ten years' worth of Canadian Parliament proceedings, which consists of 2.87 million parallel sentences in French and English. While the Candide system was widely regarded as successful, its success is not indicative of the potential for statistical translation between arbitrary language pairs. The reason for this is that collections of parallel texts as large as the Canadian Hansards are rare.</Paragraph>
    <Paragraph position="1"> Al-Onaizan et al. (2000) explains in simple terms the reasons that using large amounts of training data ensures translation quality: if a program sees a particular word or phrase one thousand times during training, it is more likely to learn a correct translation than if sees it ten times, or once, or never. Increasing the amount of training material therefore leads to improved quality. This is illustrated in Figure 1, which plots translation accuracy (measured as 100 minus word error rate) for French=English, German=English, and Spanish=English translation models trained on incrementally larger parallel corpora. The quality of the translations produced by each system increases over the 100,000 training items, and the graph suggests the the trend would continue if more data were added. Notice that the rate of improvement is slow: after 90,000 manually provided training sentences pairs, we only see a 4-6% change in performance. Sufficient performance for statistical models may therefore only come when we have access to many millions of aligned sentences.</Paragraph>
    <Paragraph position="2"> One approach that has been proposed to address the problem of limited training data is to harvest the web for bilingual texts (Resnik, 1998). The STRAND method automatically gathers web pages that are potential translations of each other by looking for documents in one language which have links whose text contains the name of another language. For example, if an English web page had a link with the text &amp;quot;Espa~nol&amp;quot; or &amp;quot;en Espa~nol&amp;quot; then the page linked to is treated as a candidate translation of the English page. Further checks verify the plausibility of its being a translation (Smith, 2002).</Paragraph>
    <Paragraph position="3"> Instead of attempting to gather new translations from the web, we describe an alternate method for automatically creating parallel corpora. Specifically, we examine the use of existing translations as a resource to bootstrap more training data, and to create data for new language pairs. We generate translation models from existing data and use them to produce translations of new sen- null tences. Incorporating this machine-created parallel data to the original set, and retraining the translation models improves the translation accuracy. To perform the retraining we use co-training (Blum and Mitchell, 1998; Abney, 2002) which is a weakly supervised learning technique that relies on having distinct views of the items being classified. The views that we employ for co-training are multiple source documents.</Paragraph>
    <Paragraph position="4"> Section 2 motivates the use of weakly supervised learning, and introduces co-training for machine translation. Section 3 reports our experimental results. One experiment shows that co-training can modestly benefit translation systems trained from similarly sized corpora. A second experiment shows that co-training can have a dramatic benefit when the size of initial training corpora are mismatched. This suggests that co-training for statistical machine translation is especially useful for languages with impoverished training corpora. Section 4 discusses the implications of our experiments, and discusses ways which our methods might be used more practically.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML