File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0828_metho.xml

Size: 8,692 bytes

Last Modified: 2025-10-06 14:10:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0828">
  <Title>First Steps towards Multi-Engine Machine Translation</Title>
  <Section position="3" start_page="155" end_page="156" type="metho">
    <SectionTitle>
2 Collecting Translation Candidates
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="155" end_page="155" type="sub_section">
      <SectionTitle>
2.1 Setting up Statistical MT
</SectionTitle>
      <Paragraph position="0"> In the general picture laid out in the preceding section, statistical MT plays an important role for several reasons. On one hand, the construction of a relatively well-performing phrase-based SMT system from a given set of parallel corpora is no more overly difficult, especially if -- as in the case in this shared task -- word alignments and a decoder are provided.</Paragraph>
      <Paragraph position="1"> Furthermore, once the second task in our chain will have been surmounted, it will be relatively easy to feed back building blocks of improved translations into the phrase table, which constitutes the central resource of the SMT system Therefore, SMT facilitates experiments aiming at dynamic and interactive adaptation, the results of which should then also be applicable to MT engines that represent knowledge in a more condensed form.</Paragraph>
      <Paragraph position="2"> In order to collect material for testing these ideas, we constructed phrase tables for all four languages, following roughly the procedure given in (Koehn, 2004) but deviating in one detail related to the treatment of unaligned words at the beginning or end of the phrases1. We used the Pharaoh decoder as described on http://www.statmt.org/wpt05/mt-sharedtask/ after normalization of all tables to lower case.</Paragraph>
    </Section>
    <Section position="2" start_page="155" end_page="156" type="sub_section">
      <SectionTitle>
2.2 Using Commercial Engines
</SectionTitle>
      <Paragraph position="0"> As our main interest is in the integration of statistical and rule-based MT, we tried to collect results from &amp;quot;conventional&amp;quot; MT systems that had more or less uniform characteristics across the languages involved. We could not find MT engines supporting all four source languages, and therefore decided to drop Finnish for this part of the experiment. We sent the texts of the other three languages through several incarnations of Systran-based MT Web-services2 and through an installation of Lernout &amp; Hauspie Power  close to each other so that it did not seem worthwhile to explore the differences systematically. Instead we ranked the services according to errors in an informal comparison and took for each sentence the first available translation in this order.</Paragraph>
      <Paragraph position="1"> 3After having collected or computed all translations, we observed that in the case of French, both systems were quite sensitive to the fact that the apostrophes were formatted as separate tokens in the source texts (l ' homme instead of l'homme). We therefore modified and retranslated the French texts, but did not explore possible effects of similar transformations in the other languages.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="156" end_page="156" type="metho">
    <SectionTitle>
3 Heuristic Selection
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
3.1 Approach
</SectionTitle>
      <Paragraph position="0"> We implemented two different ways to select, out of a set of alternative translations of a given sentence, one that looks most promising. The first approach is purely heuristic and is limited to the case where more than two candidates are given. For each candidate, we collect a set of features, consisting of words and word n-grams (n [?]{2,3,4}). Each of these features is weighted by the number of candidates it appears in, and the candidate with the largest feature weight per word is taken. This can be seen as the similarity of each of the candidate to a prototypical version composed as a weighted mixture of the collection, or as being remotely related to a sentence-specific language model derived from the candidates. The heuristic measure was used to select &amp;quot;favorite&amp;quot; from each group of competing translations obtained from the same source sentence, yielding a fourth set of translations for the sentences given in DE, FR, and ES.</Paragraph>
      <Paragraph position="1"> A particularity of the shared task is the fact that the source sentences of the development and test sets form a parallel corpus. Therefore, we can not only integrate multiple translations of the same source sentence into a hopefully better version, but we can merge the translations of corresponding parts from different source languages into a target form that combines their advantages. This approach, called triangulation in (Kay, 1997), can be motivated by the fact that most cases of translation for dissemination involve multiple target languages; hence one can assume that, except for the very first of them, renderings in multiple languages exist and can be used as input to the next step4. See also (Och and Ney, 2001) for some related empirical evidence. In order to obtain a first impression of the potential of triangulation in the domain of parliament debates, we applied the selection heuristics to a set of four translations, one from Finnish, the other three the result of the selections mentioned above.</Paragraph>
    </Section>
    <Section position="2" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
3.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> The BLEU scores (Papineni et al., 2002) for 10 direct translations and 4 sets of heuristic selections 4Admittedly, in typical instances of such chains, English would appear earlier.</Paragraph>
      <Paragraph position="1">  thereof are given in Table 1. These results show that in each group of translations for a given source language, the statistical engine came out best. Furthermore, our heuristic approach for the selection of the best among a small set of candidate translations did not result in an increase of the measured BLEU score, but typically gave a score that was only slightly better than the second best of the ingredients. This somewhat disappointing result can be explained in two ways. Apparently, the selection heuristic does not give effective estimates of translation quality for the candidates. Furthermore, the granularity on which the choices have to bee made is too coarse, i.e. the pieces for which the symbolic engines do produce better translations than the SMT engine are accompanied by too many bad choices so that the net effect is negative.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="156" end_page="157" type="metho">
    <SectionTitle>
4 Statistical Selection
</SectionTitle>
    <Paragraph position="0"> The other score we used was based on probabilities as computed by the trigram language model for English provided by the organizers of the task, in a representation compatible with the SRI LM toolkit  (Stolcke, 2002). However, a correct implementation for obtaining these estimates was not available in time, so the selections generated from the statistical language model could not be used for official submissions, but were generated and evaluated after the closing date. The results, also displayed in Table 1, show that this approach can lead to slight improvements of the BLEU score, which however turn out not to be statistically sigificant in then sense of (Zhang et al., 2004).</Paragraph>
  </Section>
  <Section position="6" start_page="157" end_page="157" type="metho">
    <SectionTitle>
5 Next Steps
</SectionTitle>
    <Paragraph position="0"> When we started the experiments reported here, the hope was to find relatively simple methods to select the best among a small set of candidate translations and to achieve significant improvements of a hybrid architecture over a purely statistical approach. Although we could indeed measure certain improvements, these are not yet big enough for a conclusive &amp;quot;proof of concept&amp;quot;. We have started a refinement of our approach that can not only pick the best among translations of complete sentences, but also judge the quality of the building blocks from which the translations are composed. First informal results look very promising. Once we can replace single phrases that appear in one translation by better alternatives taken from a competing candidate, chances are good that a significant increase of the overall translation quality can be achieved.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML