File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1408_metho.xml

Size: 16,034 bytes

Last Modified: 2025-10-06 14:07:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1408">
  <Title>An Efficient A* Search Algorithm for Statistical Machine Translation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 from (Brown et al., 1993).
</SectionTitle>
    <Paragraph position="0"> In Model 4 the statistical alignment model is decomposed into five sub-models: the lexicon model p(fje) for the probability that the source word f is a translation of the target word e, the distortion model p=1(j j0jC(fj);E) for the probability that the translations of two consecutive target words have the position difference j j0 where C(fj) is the word class of fj and E is the word class of the first of the two consecutive target words, the distortion model p&gt;1(j j0jC(fj)) for the probability that the words aligned to one target words have the position difference j j0, the fertility model p( je) for the probability that a target language word e is aligned to source language words, the empty word fertility model p( 0je0) for the probability that exactly 0 words remain unaligned to.</Paragraph>
    <Paragraph position="1"> The final probability p(fJ1 ;aJ1jeI1) for Model 4 is obtained by multiplying the probabilities of the sub-models for all words. For a detailed description for Model 4 the reader is referred to (Brown et al., 1993).</Paragraph>
    <Paragraph position="2"> We use Model 4 in this paper for two reasons. First, it has been shown that Model 4 produces a very good alignment quality in comparison to various other alignment models (Och and Ney, 2000b). Second, the dependences in the distortion model along the target language words make it quite easy to integrate standard n-gram language models in the search process. This would be more difficult in the HMM alignment model (Vogel et al., 1996). Yet, many of the results presented in the following are also applicable to other alignment models.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Search problem
</SectionTitle>
    <Paragraph position="0"> The following tasks have to be performed both using A* and beam search (BS): The search space has to be structured into a search graph. This search graph typically includes an initial node, intermediary nodes (partial hypotheses), and goal nodes (com null pleted hypotheses). A node contains the following information: - the predecessor words u;v in the target language, - the score of the hypothesis, - a backpointer to the preceding partial hypothesis, - the model specific information de- null scribed at the end of this subsection. A scoring function Q(n) + h(n) has to be defined which assigns a score to every node n. For beam search, this is the score Q(n) of a best path to this node. In the A* algorithm, an estimation h(n) of the score of a best path from node n to a goal node is added.</Paragraph>
    <Paragraph position="1"> (Berger et al., 1996) presented a method to structure the search space. Our search algorithm for Model 4 uses a similar structuring of the search space. We will shortly review the basic concepts of this search space structure: Every partial hypothesis consists of a prefix of the target sentence and a corresponding alignment. A partial hypothesis is extended by accounting for exactly one additional word of the source sentence. Every extension yields an extension score which is computed by taking into account the lexicon, distortion, and fertility probabilities involved with this extension. A partial hypothesis is called open if more source words are to be aligned to the current target word in the following extensions. A hypothesis that is not open is said to be closed. Every extension of an open hypothesis will extend the fertility of the previously produced target word and an extension of a closed hypothesis will produce a new word. Therefore, the language model score is added as well if a closed hypothesis is extended.</Paragraph>
    <Paragraph position="2"> It is prohibitive to consider all possible translations of all words. Instead, we restrict the search to the most promising candidates by calculating &amp;quot;inverse translations&amp;quot; (Al-Onaizan et al., 1999). The inverse translation probability p(e j f) of a source word f is calculated as</Paragraph>
    <Paragraph position="4"> where we use a unigram model p(e) to estimate the prior probability of a target word being used. Like (Al-Onaizan et al., 1999), we use only the top 12 translations of a given source language word. In addition, we remove from this list all words whose inverse translation probability is lower than 0.01 times the best inverse translation probability. This observation pruning is the only pruning involved in our A* search algorithm. Experiments showed this does not impair translation quality, but the search becomes much more efficient. null In order to keep the search space as small as possible it is crucial to perform a recombination of search hypotheses. Every two hypotheses which can be distinguished by neither the language model state nor the translation model state can be recombined, only the hypothesis with a better score of the two needs to be considered in the subsequent search process. We use a standard trigram language model, so the relevant language model state of node n consists of the current word w(n) and the previous word v(n) (later on we will describe an improvement to this). The translation model state depends on the specific model dependencies of Model 4: a coverage set C(n) containing the already translated source language positions, the position j(n) of the previously translated source word, a flag indicating whether the hypothesis is open or closed, the number of source language words which are aligned to the empty word, a flag showing whether the hypothesis is a complete hypothesis or not.</Paragraph>
    <Paragraph position="5"> Efficient language model recombination The recombination procedure which is described above can be improved by taking into account the backing-off structure of the language model. The trigram language model we use has the property that if the count of the bigram N(u;v) = 0, then the probability P(wju;v) depends only on v. In this case the recombination can be significantly improved by recombining all nodes whose language model state has the property N(u;v) = 0 only with respect to v. Obviously, this could be generalized to other types of language models as well.</Paragraph>
    <Paragraph position="6"> Experiments have shown that by using this efficient recombination, the number of needed hypotheses can be reduced by about a factor of 4. Search algorithms We evaluate the following two search algorithms: beam search algorithm (BS): (Tillmann, 2001; Tillmann and Ney, 2000) In this algorithm the search space is explored in a breadth-first manner. The search algorithm is based on a dynamic programming approach and applies various pruning techniques in order to restrict the number of considered hypotheses. For more details see (Tillmann, 2001).</Paragraph>
    <Paragraph position="7"> A* search algorithm: In A*, all search hypotheses are managed in a priority queue. The basic A* search (Nilsson, 1971) can be described as follows:  1. initialize priority queue with an empty hypothesis 2. remove the hypothesis with the highest score from the priority queue 3. if this hypothesis is a goal hypothesis: output this hypothesis and terminate 4. produce all extensions of this hypothesis and put the extensions to the queue 5. goto 2  The so-called heuristic function estimates the probability of a completion of a partial hypothesis. This function is called admissible if it never underestimates this probability. Thus, admissible heuristic functions are always optimistic. The A* search algorithm corresponds to the Dijkstra algorithm if the heuristic function is equal to zero.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Admissible heuristic function
</SectionTitle>
    <Paragraph position="0"> In order to perform an efficient search with the A* search algorithm it is crucial to use a good heuristic function. We only know of the work by (Wang and Waibel, 1997) dealing with heuristic functions for search in statistical machine translation. They developed a simple heuristic function for Model 2 from (Brown et al., 1993) which was non admissible. In the following we develop a guaranteed admissible heuristic function for Model 4 taking into account distortion probabilities and the coupling of lexicon, fertility, and language model probabilities.</Paragraph>
    <Paragraph position="1"> The basic idea for developing a heuristic function for the alignment models is the fact that all source sentence positions which have not been covered so far still have to be translated in order to complete the sentence. Therefore, the value of the heuristic function HX(n) for a node n can be deduced if we have an estimation hX(j) of the optimal score of translating position j (here X denotes different possibilities to choose the heuristic function):</Paragraph>
    <Paragraph position="3"> where C(n) is the coverage set.</Paragraph>
    <Paragraph position="4"> The simplest realization of a heuristic function, denoted as hT(j), takes into account only the translation probability p(fje): hT(j) = maxe p(fjje) This heuristic function can be refined by introducing also the fertility probabilities (symbol F) of a target word e:</Paragraph>
    <Paragraph position="6"> Thereby, a coupling between the translation and fertility probabilities is achieved. We have to take the -th root in order to avoid that the fertility probability of a target word whose fertility is higher than one is taken into account for every source word aligned to it. For words which are translated by the empty word e0, no fertility probability is used.</Paragraph>
    <Paragraph position="7"> The language model can be incorporated by considering that for every target word there exists an optimal language model probability: pL(e) = maxu;v p(eju;v) Here, we assume a trigram language model.</Paragraph>
    <Paragraph position="8"> Thus, a heuristic function including a coupling between translation, fertility, and language model probabilities (TFL) is given by:</Paragraph>
    <Paragraph position="10"> This value can be precomputed efficiently before the search process itself starts.</Paragraph>
    <Paragraph position="11"> The heuristic function for the distortion probabilities depends on the used model. For Model 4,</Paragraph>
    <Paragraph position="13"> Here, E refers to the class of the previously aligned target word.</Paragraph>
    <Paragraph position="14"> The heuristic functions hD(j) involve maximizations over the source positions j0. The domain of this variable shrinks during search as more and more words get translated. Therefore, it is possible to improve this heuristic function during search to perform a maximization only over the free source language positions j0. For Model 4 we compute the following heuristic function with two arguments:</Paragraph>
    <Paragraph position="16"> Thus, we obtain as an estimation of the distortion</Paragraph>
    <Paragraph position="18"> This yields the following heuristic functions taking into account translation, fertility, language, and distortion model probabilities:</Paragraph>
    <Paragraph position="20"> Using these heuristic functions we have the overhead of performing this rest cost estimation for every coverage set in search. The experiments will show that these additional costs are overcompensated by the gain in reducing the search space that has to be expanded during the A* search.</Paragraph>
    <Paragraph position="21"> To assess the predictive power of the various components in the heuristic, we compare the value of the heuristic function of the empty hypothesis with the score of the optimal translation. A heuristic function is better if the difference between these two values is small. Table 1 contains a comparison of various heuristic functions. We compare the average costs (negative logarithm of the probabilities) of the optimal translation and the average of the estimated costs of the empty hypothesis. Typically, the estimated costs of TFLD and the real costs differ by factor  We will see later in Section 6 that the guaranteed admissible heuristic functions described above result in dramatically more efficient search.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Empirical heuristic functions
</SectionTitle>
    <Paragraph position="0"> In this section we describe a new method to obtain an almost admissible heuristic function by a multi pass search. This yields a significantly more efficient search than using the admissible heuristic functions. Thus, we lose the strict guarantee to avoid search errors, but obtain a significant time gain.</Paragraph>
    <Paragraph position="1"> The idea of an empirical heuristic function is to perform a multi-pass search. In the first pass a good admissible heuristic function (here: HTFLD) is used. If this search does not need too much memory the search process is finished. If the search failed, it is restarted using an improved heuristic function which had been obtained during the initial search process. This heuristic function is computed such that it has the property that it is admissible with respect to the explored search space. That means, the heuristic function is optimistic with respect to every node in the search space explored in the first pass.</Paragraph>
    <Paragraph position="2"> Specifically, during the first pass, we maintain a two-dimensional matrix hE(j;j0) with (J +2) (J + 2) entries which are all initialized with 1.</Paragraph>
    <Paragraph position="3"> The entry hE(j;j0) is the best score that was computed for translating the source language word in position j0 if the previously covered source sentence position is j. The matrix entry is updated for every extension of a node n ! n0:</Paragraph>
    <Paragraph position="5"> Here, p(n;n0) is the probability of the extension n ! n0. hE(0;j) is the empirical score of starting a sentence by covering the j-th source sentence position first. Likewise, hE(j;J +1) is the empirical score of finishing a sentence with j as the last source sentence position that was covered.</Paragraph>
    <Paragraph position="7"> In this calculation of hE(j), we maximize over the columns of a matrix. The translation of the source sentence can be viewed as a Traveling Salesman Problem where the source sentence positions are the cities that have to be visited. Thus, the maximization over the columns is equivalent to assuring that the position j will be left after the visit. We design an improved heuristic function using the following principle (Aigner, 1993): Each city has to be both reached and left. Therefore, in order to take an upper bound of reaching a city into account, we divide each column of the matrix by its maximum and maximize over the rows of the matrix (Aigner, 1993):</Paragraph>
    <Paragraph position="9"> We obtain the following empirical heuristic functions: null</Paragraph>
    <Paragraph position="11"> If the search fails in the first pass due to the restriction of the number of hypotheses - which was 1 million in all experiments - the search can be started again using HE+(n) as a heuristic. To avoid an overestimation of the actual costs, we multiply the empirical costs by a factor lower than  1. We found in our experiments that a factor of 0.7 is sufficient. The search was restarted up to 4 times if it failed. Using this method, it is possi- null ble to translate sentences that are longer than 10 words with a restriction to 1 million hypotheses. Table 1 shows the value of the empirical heuristic function of the empty node compared to the score of the optimal goal node. The estimated costs and the real costs now differ only by a factor of 1.5 instead of a factor of 3 for the TFLD heuristic function before.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML