File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1028_metho.xml

Size: 19,778 bytes

Last Modified: 2025-10-06 14:13:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1028">
  <Title>The Candide System for Machine Translation</Title>
  <Section position="4" start_page="157" end_page="158" type="metho">
    <SectionTitle>
3. Language Modeling
</SectionTitle>
    <Paragraph position="0"> Let e be a string of English words el ... eL. A language model Pr(e) gives the probability that e would appear in grammatical English text.</Paragraph>
    <Paragraph position="1"> By the laws of conditional probability we may write</Paragraph>
    <Paragraph position="3"> Given this decomposition the language modeler's job is to estimate each of the f distributions on the right hand side.</Paragraph>
    <Paragraph position="4"> If IEI is the size of the English vocabulary, then the number of different histories et... eh-t in the kth conditional grows as IEI h-t. This presents problems both in practice and in principle--the former because we don't have enough storage to write down all the different histories, the latter because even if we could, any one history would be exceedingly rare, making it impossible to estimate probabilities accurately.</Paragraph>
    <Paragraph position="5"> For these reasons, Candide has used the so-called trigram model as its workhorse. In this model, we use the approximation null Pr(ek I et... e~_t) ~ Pr(e~ I eh-2ek-~) for each term on the right hand side above. That is, we limit the history to two words. Each triple (ek-2ek-lek) is called a trigram.</Paragraph>
    <Paragraph position="6"> It remains to estimate the Pr(e~leh_2eh_t ). One solution is to use maximum-likelihood trigram probabilities, T(eklek_2e~-t). These are obtained by scanning the training corpus c, counting the incidence of each trigram, and using these counts to form the appropriate conditional estimates. But even for this modest history size, we frequently encounter trigrams during translation that do not appear during training. This is not surprising, since there are IC\[ s = 1.773 x l0 ts possible different trigrams, yet we can encounter no more than Icl of them during training. There are 75,349,888 distinct trigrams in our training corpus, of which 53,737,350 occur exactly once.</Paragraph>
    <Paragraph position="7"> For this reason, we employ the technique of deleted interpolation \[6\]: we express Pr(ek\[ek-2e~-t) as a linear combination of the trigram probability T(ek l ek-2ek-t), the bigram probability B(ekleh_t), the unigram probability U(ek), and the uniform probability 1/IEI. The distributions B and U are obtained by counting the incidence of bigrams and unigrams in the same training corpus c. But there are fewer distinct bigrams, so we have a higher chance of seeing any given one in our training data, and a still higher chance of seeing any given unigram. The resulting formula for Pr(eklek_2ek_t) is called the smoothed trigrarn model.</Paragraph>
    <Paragraph position="8"> Even the smoothed trigram model leaves much to be desired, since it does not account for semantic and syntactic dependencies among words that do not lie within the same trigram. This has led us to use a link grammar model. This is a trainable, probabilistic grammar that attempts to capture all the information present in the trigram model, and also to make the long-range connections among words needed to advance beyond it. Link grammars are discussed in detail in \[7\].</Paragraph>
  </Section>
  <Section position="5" start_page="158" end_page="158" type="metho">
    <SectionTitle>
4. Translation Modeling
</SectionTitle>
    <Paragraph position="0"> This section describes the dements of our translation model, Pr(f \[ e). We have two distinct translation models, both described here: an EM-trained model, and a maximum-entropy model.</Paragraph>
    <Paragraph position="1"> As we explain in Section 4.2 below, the EM-trained model is developed through a succession of five provisional models. Before we describe them, we introduce the notion of alignment. null</Paragraph>
    <Section position="1" start_page="158" end_page="158" type="sub_section">
      <SectionTitle>
4.1. Alignment
</SectionTitle>
      <Paragraph position="0"> Consider a pair of French and English sentences (e, f) that are translations of one another. Although we argued above that word-for-word translation will not work to develop f from e, it is clear that there is some relation between the individual words of the two sentences. A typical assignment of relations  subscripts give the position of each word in the sentence. We call such a set of connections between sentences an alignment. Formally we express it as a set a of pairs (j, i), where each pair stands for a connection between the jth word of f and the ith word of e. Our intention is to connect f~ and ei when ei was one of the words expressing in English the concept that fj (possibly along with other words of f) expresses in French. In its most general form, an alignment may consist of any set a of (j, i) pairs. But for shnplicity, we restrict ourselves to alignments in which each French word is connected to a unique English word.</Paragraph>
      <Paragraph position="1"> We cannot hope to discover alignments with certainty. Our strategy is to train a parametric model for the joint distribution Pr(f, a \[ e), where the alignment a is hidden. In principle, the desired conditional Pr(f I e) may then be obtained as ~aPr(f, a le), where the sum is taken over all possible alignments of e and f. In practice this is possible only for our first two models. For the remaining models, we approximate Pr(f I e) as follows. During training, we find the single most probable alignment &amp;, and sum Pr(f, a I e) over a small neighborhood of &amp;. During decoding, we simply use Pr(f, ale). 4.2. EM-Trained Models We now sketch the structure of five models of increasing complexity, the last of which is our EM-trained translation model. For an in-depth treatment, the reader is referred to \[3\].</Paragraph>
      <Paragraph position="2">  1. Word Translation This is our simplest model, intended  to discover probable individual-word translations. The free parameters of this model are word translation probabilities t(fj I ei). Each of these parameters is initialized to 1/I.FI, where Y is our French vocabulary. Thus we make no initial assumptions about appropriate French-English word pairings. The iterative training procedure automatically finds appropriate translations, and assigns them high probability.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="158" end_page="159" type="metho">
    <SectionTitle>
2. Local Alignment To make our model more realistic,
</SectionTitle>
    <Paragraph position="0"> we introduce an alignment variable aj for each position j of f; aj is the position in e to which the jth word of f is aligned. (French words that appear to arise spontaneously  are said to align to the null word, in position 0 ofe.) Formally, we insert a parameter Pr(a~ I J, re, l) into our expression for Pr(f, ale ). This expresses the probability that position in an arbitrary French sentence of length ra is aligned with position aj in any English sentence of length l that is its translation. The identities of the words in these positions do not influence the alignment probabilities.</Paragraph>
    <Paragraph position="1"> 3. Fertilities As we observed earlier, a single English word may yield 0, I or more French words, for instance as when not translates to ne...pus. This idea is implicit in our notion of alignment, but not explicitly related to word identities. To capture this phenomenon explicitly, this model introduces the notion of fertility. The fertility ~(el) is the number of French words in f that ei generates in translation. Fertility is incorporated into this model through the parameters ~b(nlel), the probability that ~b(ei) equals n.</Paragraph>
  </Section>
  <Section position="7" start_page="159" end_page="159" type="metho">
    <SectionTitle>
4. Class-Based Alignment In the preceding model,
</SectionTitle>
    <Paragraph position="0"> though the fertilities are conditioned upon word identities, the alignment parameters are not. We have already pointed out how unrealistic this is, since it aligns positionsin the (e, f) pair with no regard for the words found there. This model remedies the problem by expressing alignments in terms of parameters that depend upon the classes of words that lie at the aligned positions. Each word f in our French vocabulary .T is placed in one of approximately 50 classes; likewise for each e in the English vocabulary S. The assignment of words to classes is made automatically through another statistical training procedure \[3\].</Paragraph>
    <Paragraph position="1"> 5. Non-Deficient Alignment The preceding two models suffer from a problem we call deficiency: they assign non-zero probability to &amp;quot;alignments&amp;quot; that do not correspond to strings of French words at all. For instance, two French words may be assigned to lie at the same position in the sentence. Words may be placed before the start of the sentence, or after its end. This model eliminates such spurious alignments.</Paragraph>
    <Paragraph position="2"> These five models are trained in succession on the same data, with the final parameter values of one model serving as the starting point for the next. For the current version of Candide, we used a corpus of 2,205,733 English-French sentence pairs, drawn mostly from the Hansards, which are the proceedings of the Canadian Parliament. The entire computation took a total of approximately 3600 processor-hours distributed over fifteen IBM Model 530H POWERstations.</Paragraph>
    <Paragraph position="3"> The reader may be wondering why we have five translation models instead of one. This is because the EM algorithm, though guaranteed to converge to a local maximum, need not converge to a global one. A weakness of the algorithm is that it may yield a parameter vector 8 that is indeed a local maximum, but which does not model the data well.</Paragraph>
    <Paragraph position="4"> It so happens though that model 1 has a special form that ensures that EM training is guaranteed to converge to a global maximum. By using model l's final parameter vector as the initial vector for model 2, we are assured that we are at a reasonably good starting point for training the latter. By extension of this argument, we proceed through the training of each model in succession, with some confidence that each model's starting point is a good one.</Paragraph>
    <Section position="1" start_page="159" end_page="159" type="sub_section">
      <SectionTitle>
4.3. Context Sensitive Models
</SectionTitle>
      <Paragraph position="0"> All of the preceding translation models make one important simplification: each English word acts independently of all the others in the sentence to generate the French words to which it is aligned. But it is easy to convince oneself that this approach is inadequate; clearly run will translate differently in Let's run the program! and Let's run the race!. Intuitively, we would like to make the translation of a word depend upon context in which it appears.</Paragraph>
      <Paragraph position="1"> For this reason, we have constructed translation models that take context into account. Our instinct is to make the translation of a word depend upon its neighbors, say writing t(fj \[ ei ei:~l ...) for the word-translation probabilities. But this is impractical, because of the same difficulties that confront language models with long histories.</Paragraph>
      <Paragraph position="2"> To overcome this, we employ a technique--maximum-entropy modeling--that deals with small chosen subsets of a potentially large number of conditioning variables. We begin with a large set Q = {bl(f,e,e) b2(f,e,e) bs(f,e,e)...) of binary-valued functions. Each such function asks some yes/no question about the French word f, the English word e, and the context e in which e appears.</Paragraph>
      <Paragraph position="3"> The training procedure works iteratively to find a small sub-set Q' = {bhl(fj,el,e) bh2(fj,ei,e)...b~,~(fj,el,e)) that disambiguates the senses of the English word in context. Formally, it develops a distribution t(fj I el Q') that tells us if fj is a good translation of e~ in the context e. Since this procedure is costly in computer time, we develop such models only for the 2,000 most common English words. For more information about maximum-entropy modeling, the reader is referred to \[4\].</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="159" end_page="160" type="metho">
    <SectionTitle>
5. Analysis-Transfer-Synthesis
</SectionTitle>
    <Paragraph position="0"> Although we try to obtain accurate estimates of the parameters of our translation models by training on a large amount of text, this data is not used as effectively as it might be. For instance, the word-translation probabilities ~(parle I speaks) and t(parlent I speak) must be learned separately, though they express the underlying equivalence of the infinitives parler and to speak.</Paragraph>
    <Paragraph position="1"> For this reason, we have adopted for Candide a variation of the analysis-transfer-synthesis paradigm. In this paradigm, translation takes place not between raw French and English texts, but between intermediate forms of the two languages.</Paragraph>
    <Paragraph position="2"> Note that because translation is effected between intermediate French and intermediate English, all our models are trained upon intermediate text as well. For training, each (e, f) pair of our data is subjected to an analysis step: the French is rendered into an intermediate French f', the English into intermediate English e'. The English transformation is constructed to ensure that it is invertible; its inverse, from intermediate English to standard English, is usually called synthesis.</Paragraph>
    <Paragraph position="3"> The aim of these transformations is three-fold: to suppress lexicai variations that conceal regularities between the two languages, to reduce the size of both vocabularies, and to reduce the burden on the alignment model by making coor- null dinating phrases resemble each other as closely as possible with respect to length and word order.</Paragraph>
    <Paragraph position="4"> Both the English and the French analysis steps consist of five classes of operations: segmentation, name and number detection, case and spelling correction, morphological analysis, and linguistic normalization. During segmentation, the French is divided (if possible) into shorter phrases that represent distinct concepts. This does not modify the text, but the translation model, used later, respects this division by ignoring alignments that cross segment boundaries.</Paragraph>
    <Paragraph position="5"> During name and number detection, numbers and proper names--word strings such as Ethiopie, Grande Bretagne and $.85 era--are removed from the French text and replaced by generic name and number markers. Removing names and numbers greatly reduces the size of PS and .T. The excised texts are translated by rule and kept in a table, to be substituted back into the English sentence during synthesis.</Paragraph>
    <Paragraph position="6"> During case and spelling correction, we correct any obvious spelling errors, and suppress the case variations in word spellings that arise from the conventions of English and French typography.</Paragraph>
    <Paragraph position="7"> During morphological analysis, we first use a hidden Maxkov model \[8\] to assign part-of-speech labels to the French, then use these labels to replace inflected verb forms with their infiuitives, preceded by an appropriate tense marker. We also put nouns into singular form and precede them by number markers, and perform a variety of other morphological transformations. null Finally, during linguistic normalization we perform a series of word reorderings, insertions and rewritings intended to regularize each language, and to make the two languages more closely resemble each other. For example, the contractions au and du are rewritten as d le and de le. Constructions such as il y a and he...pus are replaced with one-word tokens. The English possessive construction is made to resemble French by removing the's or 'sutfix, reordering noun phrases, and inserting an additional token. Thus my aunt's pen becomes intermediate English dummy-article pen's my aunt; note the similarity to the French le stylo de ma tante.</Paragraph>
  </Section>
  <Section position="9" start_page="160" end_page="160" type="metho">
    <SectionTitle>
6. Operation of Candide
</SectionTitle>
    <Paragraph position="0"> In previous sections we have indicated how the parameters of Candide's various models are determined via the EM algorithm and ma~c_imum-entropy methods. We now outline the steps involved in the execution of Candide as it translates a French passage into English. The process of translation, divided into analysis, transfer, and synthesis stages, is depicted in Figure 3.</Paragraph>
    <Paragraph position="1"> In the analysis stage, the French input string f is converted into f~, as discussed above. The output of this stage is denoted in Figure 3 as Intermediate French.</Paragraph>
    <Paragraph position="2"> The transfer stage constitutes the decoding process sketched in Section 2.2 above. Decoding consists of two steps. In the first step, Candide develops a set H* of candidate decodings, using coarse versions of our translation and language models to select its elements. In the second step, the system expands H* and rescores the enlarged set using more sophisticated models. We now describe both steps in greater detail.</Paragraph>
    <Paragraph position="3"> In the first step, Candide applies a variation of the stack decoding algorithm to generate candidate decodings. Decoding proceeds left-to-right, one intermediate English word at a time. At each stage we maintain a ranked set H (~) of partial hypotheses for the intermediate English ~.</Paragraph>
    <Paragraph position="4"> In general, the elements of H (~) are partial decodings of f~; that is, only the leading i words of ~t have been filled in, and these account for only some of the words of f~. To advance the decoding, some elements of H (i) are selected to be extended by one word. The translation and language models work together to generate the i + 1st word; the resulting partial decodings are ranked; this ranked set is H (~+l). An hypothesis is complete when all words of f~ have been accounted for. Note that while the intermediate English is generated left-to-right, the treatment of intermediate French words does not necessarily proceed left-to-right, due to the word-reordering property of the channel. This is one of the key ways that translation differs from speech--a difference that greatly complicates the decoding process.</Paragraph>
    <Paragraph position="5"> The ranking of hypotheses is according to the product Pr(f~ I e~)Pr(e~). In the interest of speed, and because we must deal with partial rather than complete sentences, we employ the EM-tralned translation model and the smoothed trigram language model. The output of this step is a ranked set H* of the 140 best intermediate English sentences.</Paragraph>
    <Paragraph position="6"> During the second step, called perturbation search, we enlarge H* by considering sequences of single-word deletions, insertions or replacements to its elements. Then we rerank the enlarged set using the link grammar language model and the maximum-entropy translation model. The highest-scoring intermediate English sentence that we encounter during perturbation search is the output ~ of the transfer stage.</Paragraph>
    <Paragraph position="7"> The final stage, synthesis, converts the intermediate English ~ into a plain English sentence ~.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML