XML Viewer - w03-0310

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0310_evalu.xml
Size: 8,074 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0310">
  <Title>Bootstrapping Parallel Corpora</Title>
  <Section position="4" start_page="100000" end_page="400000" type="evalu">
    <SectionTitle>
3 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In order to conduct co-training experiments we first needed to assemble appropriate corpora. The corpus used in our experiments was assembled from the data used in the (Och and Ney, 2001) multiple source translation paper. The data was gathered from the Bulletin of the European Union which is published on the Internet in the eleven official languages of the European Union. We used a subset of the data to create a multi-lingual corpus, aligning sentences between French, Spanish, German, Italian and Portuguese (Simard, 1999). Additionally we created bilingual corpora between English and each of the five languages using sentences that were not included in the multi-lingual corpus.</Paragraph>
    <Paragraph position="1"> Och and Ney (2001) used the data to find a translation that was most probable given multiple source strings. Och and Ney found that multi-source translations using two source languages reduced word error rate when compared to using source strings from a single language.</Paragraph>
    <Paragraph position="2"> For multi-source translations using source strings in six languages a greater reduction in word error rate was achieved. Our work is similar in spirit, although instead of using multi-source translation at the time of translation, we integrate it into the training stage. Whereas Och and Ney use multiple source strings to improve the quality of one translation only, our co-training method attempts to improve the accuracy of all translation models by bootstrapping more training data from multiple source documents.</Paragraph>
    <Section position="1" start_page="100000" end_page="100000" type="sub_section">
      <SectionTitle>
3.1 Software
</SectionTitle>
      <Paragraph position="0"> The software that we used to train the statistical models and to produce the translations was GIZA++ (Och and Ney, 2000), the CMU-Cambridge Language Modeling Toolkit (Clarkson and Rosenfeld, 1997), and the ISI ReWrite Decoder. The sizes of the language models used in each experiment were fixed throughout, in order to ensure that any gains that were made were not due to the trivial reason of the language model improving (which could be done by building a larger monolingual corpus of the target language).</Paragraph>
      <Paragraph position="1"> The experiments that we conducted used GIZA++ to produce IBM Model 4 translation models. It should be observed, however, that our co-training algorithm is entirely general and may be applied to any formulation of statistical machine translation which relies on parallel</Paragraph>
    </Section>
    <Section position="2" start_page="100000" end_page="100000" type="sub_section">
      <SectionTitle>
3.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> The performance of translation models was evaluated using a held-out set of 1,000 sentences in each language, with reference translations into English. Each translation model was used to produce translation of these sentences and the machine translations were compared to the reference human translations using word error rate (WER).</Paragraph>
      <Paragraph position="1"> The results are reported in terms of increasing accuracy, rather than decreasing error. We define accuracy as 100 minus WER.</Paragraph>
      <Paragraph position="2"> Other evaluation metrics such as position independent WER or the Bleu method (Papineni et al., 2001) could have been used. While WER may not be the best measure of translation quality, it is sufficient to track performance improvements in the following experiments.</Paragraph>
    </Section>
    <Section position="3" start_page="100000" end_page="400000" type="sub_section">
      <SectionTitle>
3.3 Co-training
</SectionTitle>
      <Paragraph position="0"> Table 1 gives the result of co-training using the most accurate translation from the candidate translations produced by five translation models. Each translation model was initially trained on bilingual corpora consisting of around 20,000 human translated sentences. These translation models were used to translate 63,000 sentences, of which the top 10,000 were selected for the first round.</Paragraph>
      <Paragraph position="1"> At the next round 53,000 sentences were translated and the top 10,000 sentences were selected for the second round. The final candidate pool contained 43,000 translations and again the top 10,000 were selected. The table indicates that gains may be had from co-training. Each of the translation models improves over its initial training size at some point in the co-training. The German to English translation model improves the most - exhibiting a 2.5% improvement in accuracy.</Paragraph>
      <Paragraph position="2"> The table further indicates that co-training for machine translation suffers the same problem reported in Pierce and Cardie (2001): gains above the accuracy of the initial corpus are achieved, but decline as after a certain number of machine translations are added to the training set. This could be due in part to the manner in items are selected for each round. Because the best translations are transferred from the candidate pool to the  training pool at each round the number of &amp;quot;easy&amp;quot; translations diminishes over time. Because of this, the average accuracy of the training corpora decreased with each round, and the amount of noise being introduced increased. The accuracy gains from co-training might extend for additional rounds if the size of the candidate pool were increased, or if some method were employed to reduce the amount of noise being introduced.</Paragraph>
    </Section>
    <Section position="4" start_page="400000" end_page="400000" type="sub_section">
      <SectionTitle>
3.4 Coaching
</SectionTitle>
      <Paragraph position="0"> In order to simulate using co-training for language pairs without extensive parallel corpora, we experimented with a variation on co-training for machine translation that we call &amp;quot;coaching&amp;quot;. It employs two translation models of vastly different size. In this case we used a French to English translation model built from 60,000 human translated sentences and a German to English translation model that contained no human translated sentences. The German-English translation model was meant to represent a language pair with extremely impoverished parallel corpus. Coaching is therefore a special case of co-training in that one view (the superior one) never retrains upon material provided by the other (inferior) view.</Paragraph>
      <Paragraph position="1"> A German-English parallel corpus was created by taking a French-German parallel corpus, translating the French sentences into English and then aligning the translations with the German sentences. In this experiment the machine translations produced by the French=English translation model were always selected. Figure 3 shows the performance of the resulting German to English translation model for various sized machine produced parallel corpora.</Paragraph>
      <Paragraph position="2"> We explored this method further by translating 100,000 sentences with each of the non-German translation models from the co-training experiment in Section 3.3. The result was a German-English corpus containing 400,000 sentence pairs. The performance of the resulting model matches the initial accuracy of the model. Thus machine-translated corpora achieved equivalent quality to human-translated corpora after two orders of magnitude more data was added.</Paragraph>
      <Paragraph position="3"> The graphs illustrate that increasing the performance of translation models may be achievable using machine translations alone. Rather than the 2.5% improvement gained in co-training experiments wherein models of similar sizes were used, coaching achieves an 18%(+) improvement by pairing translation models of radically different sizes.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML