File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2162_metho.xml

Size: 11,289 bytes

Last Modified: 2025-10-06 14:15:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2162">
  <Title>Improving Statistical Natural Language Translation with Categories and Rules</Title>
  <Section position="4" start_page="0" end_page="985" type="metho">
    <SectionTitle>
2 Learning of the Translation
Lexicon
</SectionTitle>
    <Paragraph position="0"> In order to determine the STL, we use a statistical model for translation and the EM algorithm to adjust its model parameters. The simple model 1 (Brown et al., 1993) for the translation of a SL sentence d = dl...dt in a TL sentence e = el... em assumes that every TL word is generated independently as a mixture of the SL words:</Paragraph>
    <Paragraph position="2"> In the equation above t(ej\[di) stands for the probability that ej is generated by di.</Paragraph>
    <Paragraph position="3"> The assumption that each SL word influences every TL word with the same strength appears to be too simple. In the refined model 2 (Brown et al., 1993) alignment probabilities a(ilj , l, m) are included to model the effect that the position of a word influences the position of its translation.</Paragraph>
    <Paragraph position="4"> The phrasal organization of natural languages is well known and has been described by (Jackendorff, 1977) among many others. The tra- null ditional alignment probabilities depend on absolute positions and do not take that into account, as has already been noted by (Vogel et al., 1996). Therefore we developed a kind of relative weighting probability. The following model -- which we will call the model 2 ~ -makes the weight between the words di and ej dependent on the relative distances between the words dk which generated the previous word</Paragraph>
    <Paragraph position="6"> Here d(i - kll ) is the probability that word di influences a word ej if the previous word ej-1 is influenced by dk. As an effect of such a weight a (phrase-)cluster of words being moved over a long distance receives additional 'cost' only at the ends of the cluster. So we have the final translation probability for model 2~:</Paragraph>
    <Paragraph position="8"> The parameters involved can be determined using the EM algorithm (Baum, 1972). The application of this algorithm to the basic problem using a parallel bilingual corpus aligned on the sentence level is described in (Brown et al., 1993).</Paragraph>
  </Section>
  <Section position="5" start_page="985" end_page="985" type="metho">
    <SectionTitle>
3 Determining a Word Alignment
</SectionTitle>
    <Paragraph position="0"> The kind of WA we use is more general than the often used WA through a vector, where every TL word is generated by exactly one SL word. We use a matrix Z for every sentence pair, whose fields describe whether or not two words are aligned. In this approach, multiple words can be aligned to one TL word, which is motivated by collocation phenomena as for instance German compound nouns. Alignments may look like the one in figure 1 according to our method. The matrix Z contains i + 1 lines and j rows with binary values. The value zij = 1 (zij = 0) means that the word i influences (not) the word j. In figure 1 every link stands for zij = l.</Paragraph>
    <Paragraph position="1"> The models 1, 2 and 2 ~ and some similar mod-</Paragraph>
  </Section>
  <Section position="6" start_page="985" end_page="985" type="metho">
    <SectionTitle>
~~ tmontag
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> where the value xij is the strength of the influence of word di to word ej. We use a threshold 0 &lt; 1 in such a way that while the sum ~=o xi~j of the first s values is smaller than O. ~tk= o Xkj we set zi~j = O. The other values are set to 1. The permutation i0,..., il sorts the xij so that Xioj &lt; ... &lt; Xilj.</Paragraph>
    <Paragraph position="3"> Interestingly using such a WA technique does not in general lead to the same results when applied from TL to SL and vice versa. If we use P(e\[d) or P(dle ) we receive different WAs z~ d and z d-e. Intuitively the relation between the words of the sentences should be symmetric and there should be the same WA. It is possible to enforce the symmetry with zij = zed. zdeij, in order to make a link between two words only if there is a link in both WAs.</Paragraph>
    <Paragraph position="4"> It is possible to include the WA into the EM algorithm for the estimation of the model probabilities. This can be done by replacing t(ej Idi) by t(ejldi).zi j. The resulting STL becomes much cleaner in the sense that it does not contain so many wrong entries (see section 7).</Paragraph>
  </Section>
  <Section position="7" start_page="985" end_page="986" type="metho">
    <SectionTitle>
4 Learning of Translation Rules
</SectionTitle>
    <Paragraph position="0"> The incorporation of TRs adds an &amp;quot;examplebased&amp;quot; touch to the statistical approach. In a very naive approach a TR could be represented by a translation example. The obvious advantage is an expectable good quality of the translated sentences. The disadvantage is the fact that almost no sentence can be translated because every corpus would have too few examples -- the generalization capability of the naive approach is very limited.</Paragraph>
    <Paragraph position="1"> We desired a general kind of TR which does not use explicit linguistic properties of the used languages. In addition the rules should generalize from very sparse data. Therefore it seemed  natural to use WCs and shorter sequences to end up with a set of rather general rules. In order to achieve a good learning performance, all the WCs of a language are pairwise disjoint (see section 5). The function C(.) gives the class of a word or the sequence of WCs of a sequence of words.</Paragraph>
    <Paragraph position="2"> Our TRs axe triples (D, E, Z) where D is a sequence of SL WCs, E is a sequence of TL WCs and Z is a WA matrix between D and E. For using one rule in the translation process we first rewrite the probability P(eld):</Paragraph>
    <Paragraph position="4"> In order to simplify the maximization (equation 1) we use only the TR which gives the maximum probability.</Paragraph>
    <Paragraph position="5"> During the learning of those TRs we count all extractable rules occurring in the aligned corpus and define the probability p(E, ZlC(d)) P(E, Zld ) in terms of the relative frequency. We approximate P(elE, Z,d ) by simpler probabilities, so that we finally need a language model p(ejle~-l), a translation model p(ej Id, Z) and a probability p(ejlEj). For p(ejle~ -1) we use a class-based polygram language model (Schukat-Talamazzini, 1994). For the translation probability p(ej Id, Z) we use model 1 and include the information of the WA:</Paragraph>
    <Paragraph position="7"> Figure 2 shows how the application of those rules works in principle. We arrive at a list of word hypotheses with probabilities for each position. Neglecting the language model, the best decision would be to independently choose the most probable word for every position.</Paragraph>
    <Paragraph position="8"> In general the translation of a sentence involves more than one rule and usually there are many rules applicable. An applicable rule is one where the sequence of SL WCs matches a sequence of WCs in the sentence. So in the general case we have to decide for a set of rules we want to apply. This set of rules has to cover the sentence, this means that every word is used in a rule and that no word is used twice or more times. The next step is to decide how to arrange the generated units to get the translated sentence. Finally we have to decide for every position which word to use. We want all those decisions to be optimal in the sense that the following product is maximized:</Paragraph>
    <Paragraph position="10"> is a permutation of the numbers 1,..., L.</Paragraph>
  </Section>
  <Section position="8" start_page="986" end_page="987" type="metho">
    <SectionTitle>
5 Learning of Category Systems
</SectionTitle>
    <Paragraph position="0"> During the last decade some publications have discussed the problem of learning WCs using clustering techniques based on maximum likelihood criteria applied to single language corpora. The question which we pose in addition is: Which WCs are suitable for translation? It seems to make sense to require that the used WCs in the two languages are correlated, so that the information about the class of a SL word gives much information about the class of the generated TL word. Therefore it has been argued in (Fung and Wu, 1995) that independently generated WCs are not good for the use in translation.</Paragraph>
    <Paragraph position="1"> For the automatic generation of class systems exists a well known procedure (see (Kneser and Ney, 1993), (Och, 1995)) which maximizes the perplexity of the language model for a training corpus by moving one word from a class to another in an iterative procedure. The function ML(CINw_~w, ) which has to be optimized depends only on the count function Nw~w, which counts the frequency that the word w' comes after the word w.</Paragraph>
    <Paragraph position="2"> Using two sets of WCs for the TL and SL which are independent (method INDEP) does not guarantee that those WCs are much correlated. The resulting WCs have only the prop-erty that the information about the class of a word w has much information about the class of the following word w'. We want for the WCs used for translation that the information about the WC of a word has much information about the WC of the translation. For the use of the standard method for optimizing WCs we need only define a count function Nd-+e, which we do by Nd-.e(d,e) := t(eld)&amp;quot; n(e). In the  source text translation rule \[2 word hypotheses  mined and we get the new optimization criterion M L ( Cd t~Ce I Nd--+e-J- Need). The resulting classes are strongly correlated, but rarely contain words with similar syntactic/semantic properties. To arrive at WCs having both (method COMB), we determine TL WCs with the first method and afterwards we determine SL WCs with the second method.</Paragraph>
    <Paragraph position="3"> So we can use the well known iterative method to end up with WCs in different languages which are correlated. From those WCs we expect that they are more suitable for building the TRs from section 4 and finally result in a better overall translation performance.</Paragraph>
    <Paragraph position="4"> 6 Translation as a Search Problem The problem of finding the translation of a sentence can be viewed as a search problem for a path with minimal cost in a tree. If we apply the negative logarithm to the product of probabilities in equation 8 we arrive at a sum of costs which has to be minimized. The costs stem from the language model, the rule probabilities and the translation probabilities. In the search tree every node represents a partial translation for the first words or a full translation. The leaves of the tree are the nodes where the applied rules define a complete cover of the SL sentence. To reduce the search space we use additional costs for changing the order of the fragments.</Paragraph>
    <Paragraph position="5"> We use a beam search strategy (Greer et al., 1982) to find a good path in this tree. To make the search feasible we had to implement some problem specific heuristics.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML