File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1038_metho.xml

Size: 6,057 bytes

Last Modified: 2025-10-06 14:07:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1038">
  <Title>Discriminative Training and Maximum Entropy Models for Statistical Machine Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Alignment Templates
</SectionTitle>
    <Paragraph position="0"> As specific MT method, we use the alignment template approach (Och et al., 1999). The key elements of this approach are the alignment templates, which are pairs of source and target language phrases together with an alignment between the words within the phrases. The advantage of the alignment template approach compared to single word-based statistical translation models is that word context and local changes in word order are explicitly considered. null The alignment template model refines the translation probability Pr(fJ1 jeI1) by introducing two hidden variables zK1 and aK1 for the K alignment templates and the alignment of the alignment templates:</Paragraph>
    <Paragraph position="2"> Hence, we obtain three different probability distributions: Pr(aK1 jeI1), Pr(zK1 jaK1 ;eI1) and Pr(fJ1 jzK1 ;aK1 ;eI1). Here, we omit a detailed description of modeling, training and search, as this is not relevant for the subsequent exposition. For further details, see (Och et al., 1999).</Paragraph>
    <Paragraph position="3"> To use these three component models in a direct maximum entropy approach, we define three different feature functions for each component of the translation model instead of one feature function for the whole translation model p(fJ1 jeI1). The feature functions have then not only a dependence on fJ1 and eI1 but also on zK1 , aK1 .</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Feature functions
</SectionTitle>
    <Paragraph position="0"> So far, we use the logarithm of the components of a translation model as feature functions. This is a very convenient approach to improve the quality of a baseline system. Yet, we are not limited to train only model scaling factors, but we have many possibilities: null + We could add a sentence length feature:</Paragraph>
    <Paragraph position="2"> This corresponds to a word penalty for each produced target word.</Paragraph>
    <Paragraph position="3"> + We could use additional language models by using features of the following form:</Paragraph>
    <Paragraph position="5"> + We could use a feature that counts how many entries of a conventional lexicon co-occur in the given sentence pair. Therefore, the weight for the provided conventional dictionary can be learned. The intuition is that the conventional dictionary is expected to be more reliable than the automatically trained lexicon and therefore should get a larger weight.</Paragraph>
    <Paragraph position="6"> + We could use lexical features, which fire if a certain lexical relationship (f;e) occurs:</Paragraph>
    <Paragraph position="8"> + We could use grammatical features that relate certain grammatical dependencies of source and target language. For example, using a function k(C/) that counts how many verb groups exist in the source or the target sentence, we can define the following feature, which is 1 if each of the two sentences contains the same number of verb groups:</Paragraph>
    <Paragraph position="10"> In the same way, we can introduce semantic features or pragmatic features such as the dialogue act classification.</Paragraph>
    <Paragraph position="11"> We can use numerous additional features that deal with specific problems of the baseline statistical MT system. In this paper, we shall use the first three of these features. As additional language model, we use a class-based five-gram language model. This feature and the word penalty feature allow a straight-forward integration into the used dynamic programming search algorithm (Och et al., 1999). As this is not possible for the conventional dictionary feature, we use n-best rescoring for this feature.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Training
</SectionTitle>
    <Paragraph position="0"> To train the model parameters ,M1 of the direct translation model according to Eq. 11, we use the GIS (Generalized Iterative Scaling) algorithm (Darroch and Ratcliff, 1972). It should be noted that, as was already shown by (Darroch and Ratcliff, 1972), by applying suitable transformations, the GIS algorithm is able to handle any type of real-valued features. To apply this algorithm, we have to solve various practical problems.</Paragraph>
    <Paragraph position="1"> The renormalization needed in Eq. 8 requires a sum over a large number of possible sentences, for which we do not know an efficient algorithm.</Paragraph>
    <Paragraph position="2"> Hence, we approximate this sum by sampling the space of all possible sentences by a large set of highly probable sentences. The set of considered sentences is computed by an appropriately extended version of the used search algorithm (Och et al., 1999) computing an approximate n-best list of translations. null Unlike automatic speech recognition, we do not have one reference sentence, but there exists a number of reference sentences. Yet, the criterion as it is described in Eq. 11 allows for only one reference translation. Hence, we change the criterion to allow Rs reference translations es;1;::: ;es;Rs for the sentence es:</Paragraph>
    <Paragraph position="4"> We use this optimization criterion instead of the optimization criterion shown in Eq. 11.</Paragraph>
    <Paragraph position="5"> In addition, we might have the problem that no single of the reference translations is part of the n-best list because the search algorithm performs pruning, which in principle limits the possible translations that can be produced given a certain input sentence. To solve this problem, we define for maximum entropy training each sentence as reference translation that has the minimal number of word errors with respect to any of the reference translations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML