File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1028_intro.xml

Size: 9,060 bytes

Last Modified: 2025-10-06 14:05:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1028">
  <Title>The Candide System for Machine Translation</Title>
  <Section position="3" start_page="0" end_page="157" type="intro">
    <SectionTitle>
2. Statistical Translation
</SectionTitle>
    <Paragraph position="0"> Consider the problem of translating French text to English text. Given a French sentence f, we imagine that it was originally rendered as an equivalent Enghsh sentence e. To obtain the French, the Enghsh was transmitted over a noisy communication channel, which has the curious property that English sentences sent into it emerge as their French translations. The central assumption of Candide's design is that the characteristics of this channel can be determined experimentally, and expressed mathematically.</Paragraph>
    <Paragraph position="1"> *Current address: Renaissance Technologies, Stony Brook, NY  Here f is the French text to be translated, e is the putative original English rendering, and 6 is the English translation. This formalism can be exploited to yield French-to-English translations as follows. Let us write Pr(e I f) for the probability that e was the original English rendering of the French f. Given a French sentence f, the problem of automatic translation reduces to finding the English sentence that maximizes P.r(e I f). That is, we seek 6 = argmsx e Pr(e I f).</Paragraph>
    <Paragraph position="2"> By virtue of Bayes' Theorem, we have</Paragraph>
    <Paragraph position="4"> from the channel when e is its input. We call this function the translation model; its domain is all pairs (f, e) of French and English word-strings. The term Pr(e) models the a priori probability that e was suppled as the channel input. We call this function the language model. Each of these factors--the translation model and the language model--independently produces a score for a candidate English translation e. The translation model ensures that the words of e express the ideas of f, and the language model ensures that e is a grammatical sentence. Candide sehcts as its translation the e that maximizes their product.</Paragraph>
    <Paragraph position="5"> This discussion begs two important questions. First, where do the models Pr(f\[ e) and Pr(e) come from? Second, even if we can get our hands on them, how can we search the set of all English strings to find 6? These questions are addressed in the next two sections.</Paragraph>
    <Section position="1" start_page="0" end_page="157" type="sub_section">
      <SectionTitle>
2.1. Probability Models
</SectionTitle>
      <Paragraph position="0"> We begin with a brief detour into probability theory. A probability model is a mathematical formula that purports to express the chance of some observation. A parametric model is a probability model with adjustable parameters, which can be changed to make the model better match some body of data.</Paragraph>
      <Paragraph position="1"> Let us write c for a body of data to be modeled, and 0 for a vector of parameters. The quantity Prs(c), computed according to some formula involving c and 0, is called the hkelihood  of c. It is the model's assignment of probability to the observation sequence c, according to the current parameter values 0. Typically the formula for the hkehhood includes some conattaints on the dements of 0 to ensure that Pr0(c) reaUy is a probability distribution--that is, it is always a real vahe in \[0, 1\], and for fixed 0 the sum ~c Pr0(c) over all possible c vectors is 1.</Paragraph>
      <Paragraph position="2"> Consider the problem of training this parametric model to the data c; that is, adjusting the 0 to maximize Pr0(c). Finding the maximizing 0 is an exercise in constrained optimization. If the expression for Pr0(c) is of a suitable (simple) form, the maximizing parameter vector 0 can be solved for directly.</Paragraph>
      <Paragraph position="3"> The key elements of this problem are * a vector 0 of adjustable parameters, * constraints on these parameters to ensure that we have a model, * a vector c of observations, and * the adjustment of 0, subject to constraints, to maximize the likelihood Pr0(c).</Paragraph>
      <Paragraph position="4"> We often seek more than a probability model of some observed data c. There may be some hidden statistics h, which are related to c, but which are never directly revealed; in general h itself is restricted to some set 7f of admissible values. For instance, c may be a large corpus of grammatical text, and h an assignment of parts-of-speech to each of its words. model Pr(e). Consider the translation model. As any first-year language student knows, word-for-word translation of English to French does not work. The dictionary equivalents of the Enghsh words can move forward or backward in the sentence, they may disappear completely, and new French words may appear to arise spontaneously.</Paragraph>
      <Paragraph position="5"> Guided by this observation, our approach has been to write down an enormous parametric expression, Pr0(f I e), for the translation model. To give the reader some idea of the scale of the computation, there is a parameter, ~(/\[e), for the probability that any given English word e will translate as any given French word f. There are parameters for the probability that any f may arise spontaneously, and that any e may simply disappear. There are parameters that words may move forward or backward 1, 2, 3, ... positions. And so on. We use a similar approach to write an expression for Pr0(e). In this case the parameters express things like the probability that a word e/may appear in a sentence after some word sequence eta2.., e~-t. In general, the parameters are of the form Pr(e/Iv), where the vector v is a combination of observable statistics like the identities of nearby words, and hidden statistics like the grammatical structure of the sentence. We refer to v as a historyd, from which we predict eC/.</Paragraph>
      <Paragraph position="6"> The parameter values of both models are determined by EM training. For the translation model, the training data consists of English-French sentence pairs (e, f), where e and f are translations of one another. For the language model, it consists exclusively of Enghsh text.</Paragraph>
      <Paragraph position="7"> In such cases, we proceed as follows. First we write down a parametric model Pr0(c, h). Then we attempt to adjust the parameter vector 0 to maximize the likelihood Pr0(c), where this latter is obtained as the sum ~he~ Pr0(c, h).</Paragraph>
      <Paragraph position="8"> Unfortunately, when we attempt to solve this more complicated problem, we often discover that we cannot find a closed-form solution for 0. Instead we obtain formulae that express each of the desired parameters in terms of all the others, and also in terms of the observation vector c.</Paragraph>
      <Paragraph position="9"> Nevertheless, we can frequently apply an iterative technique called the Ezpectation-Mazimization or EM Algorithm; this is a recipe for computing a sequence 0z, 02, .. * of parameter vectors. It can be shown \[2\] that under suitable conditions, each iteration of the algorithm is guaranteed to produce a better model of the training vector c; that is,</Paragraph>
      <Paragraph position="11"> with strict inequality everywhere except at stationary points of Pr0(c). When we adjust the model's parameters this way, we say it has been EM-trained.</Paragraph>
      <Paragraph position="12"> Training a model with hidden statistics is just like training one that lacks them, except that it is not possible to find a maximizing t~ in just one go. Training is now an iteratire process, involving repeated passes over the observation vector. Each pass yields an improved model of that data.</Paragraph>
      <Paragraph position="13"> Now we relate these methods to the problem at hand, which is to develop a translation model Pr(f \] e), and a language</Paragraph>
    </Section>
    <Section position="2" start_page="157" end_page="157" type="sub_section">
      <SectionTitle>
2.2. Decoding
</SectionTitle>
      <Paragraph position="0"> We do not actually search the infinite set of all English word strings to find the 6 that maximizes equation (1). Even if we restricted ourselves to word strings of length h or less, for any realistic length and English vocabulary C, this is far too large a set to search exhaustively. Instead we adapt the well-known stack decoding algorithm \[5\] of speech recognition. Though we will say more about decoding in Section 6 below, most of our research effort has been devoted to the two modeling problems.</Paragraph>
      <Paragraph position="1"> This is not without reason. The translation scheme we have just described can fail in only two ways. The first way is a search error, which means that our decoding procedure did not yield the fi that maximizes Pr(f I e)Pr(e ). The second way is a modeling error, which means that the best English translation, as supplied by a competent human, did not maximize this same product. Our tests show that only 5% of our system's errors are search errors.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML