File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2107_metho.xml

Size: 19,219 bytes

Last Modified: 2025-10-06 14:10:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2107">
  <Title>translation</Title>
  <Section position="3" start_page="0" end_page="835" type="metho">
    <SectionTitle>
2 Statistical machine translation
</SectionTitle>
    <Paragraph position="0"> The goal of SMT is to translate a given source language sentence sJ1 = s1...sJ to a target sentence tI1 = t1...tI. The methodology used (Brown et al., 1993) is based on the definition of a function Pr(tI1|sJ1) that returns the probability that tI1 is a  interaction-0, the system suggests a translation. In interaction-1, the user accepts the first five characters &amp;quot;Move &amp;quot; and presses the key s , then the system suggests completing the sentence with &amp;quot;canned documents to other directory&amp;quot;. Interactions 2 and 3 are similar. In the final interaction, the user completely accepts the present suggestion.</Paragraph>
    <Paragraph position="1"> translation of a given sJ1. Once this function is estimated, the problem can be reduced to search a sentence ^t^I1 that maximizes this probability for a given sJ1.</Paragraph>
    <Paragraph position="3"> Equation 1 summarizes the following three matters to be solved: First, an output language model is needed to distinguish valid sentences from invalid sentences in the target language, Pr(tI1).</Paragraph>
    <Paragraph position="4"> Second, a translation model, Pr(sJ1|tI1). Finally, the design of an algorithm to search for the sentence ^tI1 that maximizes this product.</Paragraph>
  </Section>
  <Section position="4" start_page="835" end_page="835" type="metho">
    <SectionTitle>
3 Statistical computer-assisted
</SectionTitle>
    <Paragraph position="0"> translation In a CAT scenario, the source sentence sJ1 and a given prefix of the target sentence ti1 are given. This prefix has been validated by the user (using a previous suggestion by the system plus some corrected words). Now, we are looking for the most probable words that complete this prefix.</Paragraph>
    <Paragraph position="2"> This formulation is very similar to the previous case, but in this one, the search is constrained to the set of possible suffixes tIi+1 instead of the whole target sentences tI1. Therefore, the same techniques (translation models, decoder algorithm, etc.) which have been developed for SMT can be used in CAT.</Paragraph>
    <Paragraph position="3"> Note that the statistical models are defined at word level. However, the CAT interface described in the first section works at character level. This is not a problem: the transformation can be performed in an easy way.</Paragraph>
    <Paragraph position="4"> Another important issue is the computational time required by the system to produce a new suggestion. In the CAT framework, real-time is required. null</Paragraph>
  </Section>
  <Section position="5" start_page="835" end_page="837" type="metho">
    <SectionTitle>
4 Phrase-based models
</SectionTitle>
    <Paragraph position="0"> The usual statistical translation models can be classified as single-word based alignment models.</Paragraph>
    <Paragraph position="1"> Models of this kind assume that an input word is generated by only one output word (Brown et al., 1993). This assumption does not correspond to the characteristics of natural language; in some cases, we need to know a word group in order to obtain a correct translation.</Paragraph>
    <Paragraph position="2"> One initiative for overcoming the above-mentioned restriction of single-word models is known as the template-based approach (Och, 2002). In this approach, an entire group of adjacent words in the source sentence may be aligned with an entire group of adjacent target words. As a result, the context of words has a greater influence and the changes in word order from source to target language can be learned explicitly. A template establishes the reordering between two sequences of word classes. However, the lexical model continues to be based on word-to-word correspondence. null A simple alternative to these models has been proposed, the phrase-based (PB) approach (Tom'as and Casacuberta, 2001; Marcu and Wong, 2002; Zens et al., 2002). The principal innovation of the phrase-based alignment model is that it attempts to calculate the translation probabilities of word sequences (phrases) rather than of only single words. These methods explicitly learn the probability of a  sequence of words in a source sentence (~s) being translated as another sequence of words in the target sentence (~t).</Paragraph>
    <Paragraph position="3"> To define the PB model, we segment the source sentence sJ1 into K phrases (~sK1 ) and the target sentence tI1 into K phrases (~tK1 ). A uniform probability distribution over all possible segmentations is assumed. If we assume a monotone alignment, that is, the target phrase in position k is produced only by the source phrase in the same position</Paragraph>
    <Paragraph position="5"> where the parameter p(~s|~t) estimates the probability of translating the phrase ~t into the phrase ~s.</Paragraph>
    <Paragraph position="6"> A phrase can be comprised of a single word (but empty phrases are not allowed). Thus, the conventional word to word statistical dictionary is included. null If we permit the reordering of the target phrases, a hidden phrase level alignment variable, aK1 , is introduced. In this case, we assume that the target phrase in position k is produced only by the source phrase in position ak.</Paragraph>
    <Paragraph position="8"> where the distortion model p(ak|ak[?]1) (the probability of aligning the target segment k with the source segment ak) depends only on the previous alignment ak[?]1 (first order model). For the distortion model, it is also assumed that an alignment depends only on the distance of the two phrases (Och and Ney, 2000):</Paragraph>
    <Paragraph position="10"> There are different approaches to the parameter estimation. The first one corresponds to a direct learning of the parameters of equations 3 or 4 from a sentence-aligned corpus using a maximum likelihood approach (Tom'as and Casacuberta, 2001; Marcu and Wong, 2002). The second one is heuristic and tries to use a word-aligned corpus (Zens et al., 2002; Koehn et al., 2003). These alignments can be obtained from single-word models (Brown et al., 1993) using the available public software GIZA++ (Och and Ney, 2003). The latter approach is used in this research.</Paragraph>
    <Paragraph position="11"> 5 Decoding in interactive machine translation The search algorithm is a crucial part of a CAT system. Its performance directly affects the quality and efficiency of translation. For CAT search we propose using the same algorithm as in MT.</Paragraph>
    <Paragraph position="12"> Thus, we first describe the search in MT.</Paragraph>
    <Section position="1" start_page="836" end_page="837" type="sub_section">
      <SectionTitle>
5.1 Search for MT
</SectionTitle>
      <Paragraph position="0"> The aim of the search in MT is to look for a target sentence tI1 that maximizes the product</Paragraph>
      <Paragraph position="2"> formed to maximise a log-linear model of Pr(tI1) and Pr(tI1|sJ1)l that allows a simplification of the search process and better empirical results in many translation tasks (Tom'as et al., 2005). Parameter l is introduced in order to adjust the importance of both models. In this section, we describe two search algorithms which are based on multi-stackdecoding (Berger et al., 1996) for the monotone and for the non-monotone model.</Paragraph>
      <Paragraph position="3"> The most common statistical decoder algorithms use the concept of partial translation hypothesis to perform the search (Berger et al., 1996). In a partial hypothesis, some of the source words have been used to generate a target prefix.</Paragraph>
      <Paragraph position="4"> Each hypothesis is scored according to the translation and language model. In our implementation for the monotone model, we define a hypothesis search as the triple (Jprime,tIprime1 ,g), where Jprime is the length of the source prefix we are translating (i.e.</Paragraph>
      <Paragraph position="5"> sJprime1 ); the sequence of Iprime words, tIprime1 , is the target prefix that has been generated and g is the score of the hypothesis (g = Pr(tIprime1 )*Pr(tIprime1 |sJprime1 )l). The translation procedure can be described as follows. The system maintains a large set of hypotheses, each of which has a corresponding translation score. This set starts with an initial empty hypothesis. Each hypothesis is stored in a different stack, according to the source words that have been considered in the hypothesis (Jprime). The algorithm consists of an iterative process. In each iteration, the system selects the best scored partial hypothesis to extend in each stack. The extension consists in selecting one (or more) untranslated word(s) in the source and selecting one (or more) target word(s) that are attached to the existing output prefix. The process continues several times or until there are no more hypotheses to extend. The final hypothesis with the highest score and with no untranslated source words is the out- null put of the search.</Paragraph>
      <Paragraph position="6"> The search can be extended to allow for non-monotone translation. In this extension, several reorderings in the target sequence of phrases are scored with a corresponding probability. We define a hypothesis search as the triple (w,tIprime1 ,g), where w = {1..J} is the coverage set that defines which positions of source words have been translated. For a better comparison of hypotheses, the store of each hypothesis in different stacks according to their value of w is proposed in (Berger et al., 1996). The number of possible stacks can be very high (2J); thus, the stacks are created on demand.</Paragraph>
      <Paragraph position="7"> The translation procedure is similar to the previous one: In each iteration, the system selects the best scored partial hypothesis to extend in each created stack and extends it.</Paragraph>
    </Section>
    <Section position="2" start_page="837" end_page="837" type="sub_section">
      <SectionTitle>
5.2 Search algorithms for iterative MT.
</SectionTitle>
      <Paragraph position="0"> The above search algorithm can be adapted to the iterative MT introduced in the first section, i.e.</Paragraph>
      <Paragraph position="1"> given a source sentence sJ1 and a prefix of the target sentence ti1, the aim of the search in iterative MT is to look for a suffix of the target sentence ^t^Ii+1 that maximises the product Pr(tI1)*Pr(sJ1|tI1) (or the log-linear model: Pr(tIprime1 )*Pr(tIprime1 |sJprime1 )l). A simple modification of the search algorithm is necessary. When a hypothesis is extended, if the new hypothesis is not compatible with the fixed target prefix, ti1, then this hypothesis is not considered.</Paragraph>
      <Paragraph position="2"> Note that this prefix is a character sequence and a hypothesis is a word sequence. Thus, the hypothesis is converted to a character sequence before the comparison.</Paragraph>
      <Paragraph position="3"> In the CAT scenario, speed is a critical aspect.</Paragraph>
      <Paragraph position="4"> In the PB approach monotone search is more efficient than non-monotone search and obtains similar translation results for the tasks described in this article (Tom'as and Casacuberta, 2004). However, the use of monotone search in the CAT scenario presents a problem: If a user introduces a prefix that cannot be obtained in a monotone way from the source, the search algorithm is not able to complete this prefix. In order to solve this problem, but without losing too much efficiency, we use the following approach: Non-monotone search is used while the target prefix is generated by the algorithm. Monotone search is used while new words are generated.</Paragraph>
      <Paragraph position="5"> Note that searching for a prefix that we already know may seem useless. The real utility of this phase is marking the words in the target sentence that have been used in the translation of the given prefix.</Paragraph>
      <Paragraph position="6"> A desirable feature of the iterative machine translation system is the possibility of producing a list of target suffixes, instead of only one (Civera et al., 2004). This feature can be easily obtained by keeping the N-best hypotheses in the last stack.</Paragraph>
      <Paragraph position="7"> In practice these N-best hypotheses are too similar. They differ only in one or two words at the end of the sentence. In order to solve this problem, the following procedure is performed: First, generate a hypotheses list using the N-best hypotheses of a regular search. Second, add to this list, new hypotheses formed by a single translation-word from a non-translated source word. Third, add to this list, new hypotheses formed by a single word with a high probability according to the target language model. Finally, sort the list maximising the diversity at the beginning of the suffixes and select the first N hypotheses.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="837" end_page="838" type="metho">
    <SectionTitle>
6 Experimental results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="837" end_page="838" type="sub_section">
      <SectionTitle>
6.1 Evaluation criteria
</SectionTitle>
      <Paragraph position="0"> Four different measures have been used in the experiments reported in this paper. These measures are based on the comparison of the system output with a single reference.</Paragraph>
      <Paragraph position="1">  * Word Error Rate (WER): Edit distance in terms of words between the target sentence provided by the system and the reference translation (Och and Ney, 2003).</Paragraph>
      <Paragraph position="2"> * Character Error Rate (CER): Edit distance in terms of characters between the target sen- null tence provided by the system and the reference translation (Civera et al., 2004). * Word-Stroke Ratio (WSR): Percentage of words which, in the CAT scenario, must be changed in order to achieve the reference. * Key-Stroke Ratio (KSR): Number of keystrokes that are necessary to achieve the reference translation divided by the number of running characters (Och et al., 2003) 1.</Paragraph>
      <Paragraph position="3"> 1In others works, an extra keystroke is added in the last iteration when the user accepts the sentence. We do not add this extra keystroke. Thus, the KSR obtained in the interaction example of Figure 1, is 3/40.</Paragraph>
      <Paragraph position="4">  eral average response time in the Spanish/English &amp;quot;XRCE&amp;quot; task.</Paragraph>
      <Paragraph position="5"> WER and CER measure the post-editing effort to achieve the reference in an MT scenario. On the other hand, WSR and KSR measure the interactive-editing effort to achieve the reference in a CAT scenario. WER and CER measures have been obtained using the first suggestion of the CAT system, when the validated prefix is void.</Paragraph>
    </Section>
    <Section position="2" start_page="838" end_page="838" type="sub_section">
      <SectionTitle>
6.2 Task description
</SectionTitle>
      <Paragraph position="0"> In order to validate the approach described in this paper a series of experiments were carried out using the XRCE corpus. They involve the translation of technical Xerox manuals from English to Spanish, French and German and from Spanish, French and German to English. In this research, we use the raw version of the corpus. Table 1 shows some statistics of training and test corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="838" end_page="838" type="sub_section">
      <SectionTitle>
6.3 Results
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the WSR and KSR obtained for several average response times, for Spanish/English translations. We can control the response time changing the number of iterations in the search algorithm. Note that real-time restrictions cause a significant degradation of the performance. However, in a real CAT scenario long iteration times can render the system useless. In order to guarantee a fast human interaction, in the remaining experiments of the paper, the mean iteration time is constrained to about 80 ms.</Paragraph>
      <Paragraph position="1"> Table 3 shows the results using monotone search and combining monotone and non-monotone search. Using non-monotone search while the given prefix is translated improves the results significantly.</Paragraph>
      <Paragraph position="2"> Table 4 compares the results when the system proposes only one translation (1-best) and when the system proposes five alternative translations (5-best). Results are better for 5-best. However, in this configuration the user must read five different  best hypothesis and 5-best hypothesis.</Paragraph>
      <Paragraph position="3"> alternatives before choosing. It is still to be shown if this extra time is compensated by the fewer key strokes needed.</Paragraph>
      <Paragraph position="4"> Finally, in table 5 we compare the post-editing effort in an MT scenario (WER and CER) and the interactive-editing effort in a CAT scenario (WSR and KSR). These results show how the number of characters to be changed, needed to achieve the reference, is reduced by more than 50%. The reduction at word level is slight or none. Note that results from English/Spanish are much better than from English/French and English/German. This is because a large part of the English/Spanish test corpus has been obtained from the index of the technical manual, and this kind of text is easier to translate.</Paragraph>
      <Paragraph position="5"> It is not clear how these theoretical gains translate to practical gains, when the system is used by real translators (Macklovitch, 2004).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="838" end_page="839" type="metho">
    <SectionTitle>
7 Related work
</SectionTitle>
    <Paragraph position="0"> Several CAT systems have been proposed in the TransType projects (SchlumbergerSema S.A. et al., 2001): In (Foster et al., 2002) a maximum entropy version of IBM2 model is used as translation model. It is a very simple model in order to achieve rea- null MT scenario (WER/CER) and the interactive-editing effort in CAT scenario (WSR/KSR). Non-monotone search and 1-best hypothesis is used. sonable interaction times. In this approach, the length of the proposed extension is variable in function of the expected benefit of the human translator.</Paragraph>
    <Paragraph position="1"> In (Och et al., 2003) the Alignment-Templates translation model is used. To achieve fast response time, it proposes to use a word hypothesis graph as an efficient search space representation. This word graph is precalculated before the user interactions. In (Civera et al., 2004) finite state transducers are presented as a candidate technology in the CAT paradigm. These transducers are inferred using the GIATI technique (Casacuberta and Vidal, 2004). To solve the real-time constraints a word hypothesis graph is used. The N-best configuration is proposed.</Paragraph>
    <Paragraph position="2"> In (Bender et al., 2005) the use of a word hypothesis graph is compared with the direct use of the translation model. The combination of two strategies is also proposed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML