File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3026_metho.xml

Size: 10,445 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3026">
  <Title>Multi-Engine Machine Translation Guided by Explicit Word Matching</Title>
  <Section position="4" start_page="101" end_page="102" type="metho">
    <SectionTitle>
2 The MEMT Algorithm
</SectionTitle>
    <Paragraph position="0"> Our Multi-Engine Machine Translation (MEMT) system operates on the single &amp;quot;top-best&amp;quot; translation output produced by each of several MT systems operating on a common input sentence.</Paragraph>
    <Paragraph position="1"> MEMT first aligns the words of the different translation systems using a word alignment matcher.</Paragraph>
    <Paragraph position="2"> Then, using the alignments provided by the matcher, the system generates a set of synthetic sentence hypothesis translations. Each hypothesis translation is assigned a score based on the alignment information, the confidence of the individual systems, and a language model. The hypothesis translation with the best score is selected as the final output of the MEMT combination.</Paragraph>
    <Section position="1" start_page="101" end_page="101" type="sub_section">
      <SectionTitle>
2.1 The Word Alignment Matcher
</SectionTitle>
      <Paragraph position="0"> The task of the matcher is to produce a word-to-word alignment between the words of two given input strings. Identical words that appear in both input sentences are potential matches. Since the same word may appear multiple times in the sentence, there are multiple ways to produce an alignment between the two input strings. The goal is to find the alignment that represents the best correspondence between the strings. This alignment is defined as the alignment that has the smallest number of &amp;quot;crossing edges. The matcher can also consider morphological variants of the same word as potential matches. To simultaneously align more than two sentences, the matcher simply produces alignments for all pair-wise combinations of the set of sentences.</Paragraph>
      <Paragraph position="1"> In the context of its use within our MEMT approach, the word-alignment matcher provides three main benefits. First, it explicitly identifies translated words that appear in multiple MT translations, allowing the MEMT algorithm to reinforce words that are common among the systems. Second, the alignment information allows the algorithm to ensure that aligned words are not included in a synthetic combination more than once. Third, by allowing long range matches, the synthetic combination generation algorithm can consider different plausible orderings of the matched words, based on their location in the original translations.</Paragraph>
    </Section>
    <Section position="2" start_page="101" end_page="102" type="sub_section">
      <SectionTitle>
2.2 Basic Hypothesis Generation
</SectionTitle>
      <Paragraph position="0"> After the matcher has word aligned the original system translations, the decoder goes to work. The hypothesis generator produces synthetic combinations of words and phrases from the original translations that satisfy a set of adequacy constraints.</Paragraph>
      <Paragraph position="1"> The generation algorithm is an iterative process and produces these translation hypotheses incrementally. In each iteration, the set of existing partial hypotheses is extended by incorporating an additional word from one of the original translations. For each partial hypothesis, a data-structure keeps track of the words from the original translations which are accounted for by this partial hypothesis. One underlying constraint observed by the generator is that the original translations are considered in principle to be word synchronous in the sense that selecting a word from one original translation normally implies &amp;quot;marking&amp;quot; a corresponding word in each of the other original translations as &amp;quot;used&amp;quot;. The way this is determined is explained below. Two partial hypotheses that have the same partial translation, but have a different set of words that have been accounted for are considered different. A hypothesis is considered &amp;quot;complete&amp;quot; if the next word chosen to extend the hypothesis is the explicit end-of-sentence marker from one of the original translation strings. At the start of hypothesis generation, there is a single hypothesis, which has the empty string as its partial translation and where none of the words in any of the original translations are marked as used.</Paragraph>
      <Paragraph position="2"> In each iteration, the decoder extends a hypothesis by choosing the next unused word from  one of the original translations. When the decoder chooses to extend a hypothesis by selecting word w from original system A, the decoder marks w as used. The decoder then proceeds to identify and mark as used a word in each of the other original systems. If w is aligned to words in any of the other original translation systems, then the words that are aligned with w are also marked as used.</Paragraph>
      <Paragraph position="3"> For each system that does not have a word that aligns with w, the decoder establishes an artificial alignment between w and a word in this system.</Paragraph>
      <Paragraph position="4"> The intuition here is that this artificial alignment corresponds to a different translation of the same source-language word that corresponds to w. The choice of an artificial alignment cannot violate constraints that are imposed by alignments that were found by the matcher. If no artificial alignment can be established, then no word from this system will be marked as used. The decoder repeats this process for each of the original translations. Since the order in which the systems are processed matters, the decoder produces a separate hypothesis for each order.</Paragraph>
      <Paragraph position="5"> Each iteration expands the previous set of partial hypotheses, resulting in a large space of complete synthetic hypotheses. Since this space can grow exponentially, pruning based on scoring of the partial hypotheses is applied when necessary.</Paragraph>
    </Section>
    <Section position="3" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
2.3 Confidence Scores
</SectionTitle>
      <Paragraph position="0"> A major component in the scoring of hypothesis translations is a confidence score that is assigned to each of the original translations, which reflects the translation adequacy of the system that produced it. We associate a confidence score with each word in a synthetic translation based on the confidence of the system from which it originated.</Paragraph>
      <Paragraph position="1"> If the word was contributed by several different original translations, we sum the confidences of the contributing systems. This word confidence score is combined multiplicatively with a score assigned to the word by a trigram language model. The score assigned to a complete hypothesis is its geometric average word score. This removes the inherent bias for shorter hypotheses that is present in multiplicative cumulative scores.</Paragraph>
    </Section>
    <Section position="4" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
2.4 Restrictions on Artificial Alignments
</SectionTitle>
      <Paragraph position="0"> The basic algorithm works well as long the original translations are reasonably word synchronous. This rarely occurs, so several additional constraints are applied during hypothesis generation.</Paragraph>
      <Paragraph position="1"> First, the decoder discards unused words in original systems that &amp;quot;linger&amp;quot; around too long. Second, the decoder limits how far ahead it looks for an artificial alignment, to prevent incorrect long-range artificial alignments. Finally, the decoder does not allow an artificial match between words that do not share the same part-of-speech.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="102" end_page="103" type="metho">
    <SectionTitle>
3 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> We combined outputs of three Arabic-to-English machine translation systems on the 2003 TIDES Arabic test set. The systems were AppTek's rule based system, CMU's EBMT system, and Systran's web-based translation system.</Paragraph>
    <Paragraph position="1"> We compare the results of MEMT to the individual online machine translation systems. We also compare the performance of MEMT to the score of an &amp;quot;oracle system&amp;quot; that chooses the best scoring of the individual systems for each sentence. Note that this oracle is not a realistic system, since a real system cannot determine at run-time which of the original systems is best on a sentence-by-sentence basis. One goal of the evaluation was to see how rich the space of synthetic translations produced by our hypothesis generator is. To this end, we also compare the output selected by our current MEMT system to an &amp;quot;oracle system&amp;quot; that chooses the best synthetic translation that was generated by the decoder for each sentence. This too is not a realistic system, but it allows us to see how well our hypothesis scoring currently performs. This also provides a way of estimating a performance ceiling of the MEMT approach, since our MEMT can only produce words that are provided by the original systems (Hogan and Frederking 1998).</Paragraph>
    <Paragraph position="2"> Due to the computational complexity of running the oracle system, several practical restrictions were imposed. First, the oracle system only had access to the top 1000 translation hypotheses produced by MEMT for each sentence. While this does not guarantee finding the best translation that the decoder can produce, this method provides a good approximation. We also ran the oracle experiment only on the first 140 sentences of the test sets due to time constraints.</Paragraph>
    <Paragraph position="3"> All the system performances are measured using the METEOR evaluation metric (Lavie, Sagae  et al., 2004). METEOR was chosen since, unlike the more commonly used BLEU metric (Papineni et al., 2002), it provides reasonably reliable scores for individual sentences. This property is essential in order to run our oracle experiments. METEOR produces scores in the range of [0,1], based on a combination of unigram precision, unigram recall and an explicit penalty related to the average length of matched segments between the evaluated translation and its reference.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML