File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0834_metho.xml
Size: 18,850 bytes
Last Modified: 2025-10-06 14:09:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0834"> <Title>Word Graphs for Statistical Machine Translation</Title> <Section position="3" start_page="191" end_page="191" type="metho"> <SectionTitle> + Interactive Machine Translation. </SectionTitle> <Paragraph position="0"> Some interactive machine translation systems make use of word graphs, e.g. (Och et al., 2003).</Paragraph> <Paragraph position="1"> State Of The Art. Although there are these many applications, there are only few publications directly devoted to word graphs. The only publication, we are aware of, is (Uef ng et al., 2002). The shortcomings of (Uef ng et al., 2002) are: + They use single-word based models only. Current state of the art statistical machine translation systems are phrase-based.</Paragraph> <Paragraph position="2"> + Their graph pruning method is suboptimal as it considers only partial scores and not full path scores.</Paragraph> <Paragraph position="3"> + The N-best list extraction does not eliminate duplicates, i.e. different paths that represent the same translation candidate.</Paragraph> <Paragraph position="4"> + The rest cost estimation is not ef cient. It has an exponential worst-case time complexity. We will describe an algorithm with linear worst-case complexity.</Paragraph> <Paragraph position="5"> Apart from (Uef ng et al., 2002), publications on weighted nite state transducer approaches to machine translation, e.g. (Bangalore and Riccardi, 2001; Kumar and Byrne, 2003), deal with word graphs. But to our knowledge, there are no publications that give a detailed analysis and evaluation of the quality of word graphs for machine translation. We will ll this gap and give a systematic description and an assessment of the quality of word graphs for phrase-based machine translation. We will show that even for hard tasks with very large vocabulary and long sentences the graph error rate drops significantly. null The remaining part is structured as follows: rst we will give a brief description of the translation system in Section 2. In Section 3, we will give a definition of word graphs and describe the generation. We will also present ef cient pruning and N-best list extraction techniques. In Section 4, we will describe evaluation criteria for word graphs. We will use the graph word error rate, which is well known from speech recognition. Additionally, we introduce the novel position-independent word graph error rate and the graph BLEU score. These are generalizations of the commonly used string-to-string evaluation criteria in machine translation. We will present experimental results in Section 5 for two Chinese English tasks: the rst one, the IWSLT task, is in the domain of basic travel expression found in phrasebooks. The vocabulary is limited and the sentences are short. The second task is the NIST Chinese English large data track task. Here, the domain is news and therefore the vocabulary is very large and the sentences are with an average of 30 words quite long.</Paragraph> </Section> <Section position="4" start_page="191" end_page="191" type="metho"> <SectionTitle> 2 Translation System </SectionTitle> <Paragraph position="0"> In this section, we give a brief description of the translation system. We use a phrase-based translation approach as described in (Zens and Ney, 2004).</Paragraph> <Paragraph position="1"> The posterior probability Pr(eI1jfJ1 ) is modeled directly using a weighted log-linear combination of a trigram language model and various translation models: a phrase translation model and a word-based lexicon model. These translation models are used for both directions: p(fje) and p(ejf). Additionally, we use a word penalty and a phrase penalty. With the exception of the language model, all models can be considered as within-phrase models as they depend only on a single phrase pair, but not on the context outside of the phrase. The model scaling factors are optimized with respect to some evaluation criterion (Och, 2003).</Paragraph> <Paragraph position="2"> We extended the monotone search algorithm from (Zens and Ney, 2004) such that reorderings are possible. In our case, we assume that local reorderings are suf cient. Within a certain window, all possible permutations of the source positions are allowed. These permutations are represented as a re-ordering graph, similar to (Zens et al., 2002). Once we have this reordering graph, we perform a monotone phrase-based translation of this graph. More details of this reordering approach are described in (Kanthak et al., 2005).</Paragraph> </Section> <Section position="5" start_page="191" end_page="193" type="metho"> <SectionTitle> 3 Word Graphs </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="191" end_page="192" type="sub_section"> <SectionTitle> 3.1 De nition </SectionTitle> <Paragraph position="0"> A word graph is a directed acyclic graph G = (V, E) with one designated root node n0 2 V . The edges are labeled with words and optionally with scores.</Paragraph> <Paragraph position="1"> We will use (n, nprime, w) to denote an edge from node n to node nprime with word label w. Each path through the word graph represents a translation candidate. If the word graph contains scores, we accumulate the edge scores along a path to get the sentence or string score.</Paragraph> <Paragraph position="2"> The score information the word graph has to contain depends on the application.</Paragraph> <Paragraph position="3"> If we want to use the word graph as a word lter, we do not need any score information at all. If we want to extract the single- or N-best hypotheses, we have to retain the string or sentence score information. The information about the hidden variables of the search, e.g. the phrase segmentation, is not needed for this purpose. For discriminative training of the phrase translation probabilities, we need all the information, even about the hidden variables.</Paragraph> </Section> <Section position="2" start_page="192" end_page="192" type="sub_section"> <SectionTitle> 3.2 Generation </SectionTitle> <Paragraph position="0"> In this section, we analyze the search process in detail. Later, in Section 5, we will show the (experimental) complexity of each step. We start with the source language sentence that is represented as a linear graph. Then, we introduce reorderings into this graph as described in (Kanthak et al., 2005). The type of reordering should depend on the language pair. In our case, we assume that only local reorderings are required. Within a certain window, all possible reorderings of the source positions are allowed.</Paragraph> <Paragraph position="1"> These permutations are represented as a reordering graph, similar to (Knight and Al-Onaizan, 1998) and (Zens et al., 2002).</Paragraph> <Paragraph position="2"> Once we have this reordering graph, we perform a monotone phrase-based translation of this graph.</Paragraph> <Paragraph position="3"> This translation process consists of the following steps that will be described afterward: 1. segment into phrase 2. translate the individual phrases 3. split the phrases into words 4. apply the language model Now, we will describe each step. The rst step is the segmentation into phrases. This can be imagined as introducing short-cuts into the graph. The phrase segmentation does not affect the number of nodes, because only additional edges are added to the graph.</Paragraph> <Paragraph position="4"> In the segmented graph, each edge represents a source phrase. Now, we replace each edge with one edge for each possible phrase translation. The edge scores are the combination of the different translation probabilities, namely the within-phrase models mentioned in Section 2. Again, this step does not increase the number of nodes, but only the number of edges.</Paragraph> <Paragraph position="5"> So far, the edge labels of our graph are phrases. In the nal word graph, we want to have words as edge labels. Therefore, we replace each edge representing a multi-word target phrase with a sequence of edges that represent the target word sequence. Obviously, edges representing a single-word phrase do not have to be changed.</Paragraph> <Paragraph position="6"> As we will show in the results section, the word graphs up to this point are rather compact. The score information in the word graph so far consists of the reordering model scores and the phrase translation model scores. To obtain the sentence posterior probability p(eI1jfJ1 ), we apply the target language model. To do this, we have to separate paths according to the language model history. This increases the word graph size by an order of magnitude.</Paragraph> <Paragraph position="7"> Finally, we have generated a word graph with full sentence scores. Note that the word graph may contain a word sequence multiple times with different hidden variables. For instance, two different segmentations into source phrases may result in the same target sentence translation.</Paragraph> <Paragraph position="8"> The described steps can be implemented using weighted nite state transducer, similar to (Kumar and Byrne, 2003).</Paragraph> </Section> <Section position="3" start_page="192" end_page="193" type="sub_section"> <SectionTitle> 3.3 Pruning </SectionTitle> <Paragraph position="0"> To adjust the size of the word graph to the desired density, we can reduce the word graph size using forward-backward pruning, which is well-known in the speech recognition community, e.g. see (Mangu et al., 2000). This pruning method guarantees that the good strings (with respect to the model scores) remain in the word graph, whereas the bad ones are removed. The important point is that we compare the full path scores and not only partial scores as, for instance, in the beam pruning method in (Uef ng et al., 2002).</Paragraph> <Paragraph position="1"> The forward probabilities F(n) and backward probabilities B(n) of a node n are de ned by the following recursive equations:</Paragraph> <Paragraph position="3"> The forward probability of the root node and the backward probabilities of the nal nodes are initialized with one. Using a topological sorting of the nodes, the forward and backward probabilities can be computed with linear time complexity. The posterior probability q(n, nprime, w) of an edge is de ned as:</Paragraph> <Paragraph position="5"> The posterior probability of an edge is identical to the sum over the probabilities of all full paths that contain this edge. Note that the backward probability of the root node B(n0) is identical to the sum over all sentence probabilities in the word graph.</Paragraph> <Paragraph position="6"> Let q[?] denoted the maximum posterior probability of all edges and let t be a pruning threshold, then we prune an edge (n, nprime, w) if: q(n, nprime, w) < q[?] C/ t</Paragraph> </Section> <Section position="4" start_page="193" end_page="193" type="sub_section"> <SectionTitle> 3.4 N-Best List Extraction </SectionTitle> <Paragraph position="0"> In this section, we describe the extraction of the N-best translation candidates from a word graph.</Paragraph> <Paragraph position="1"> (Uef ng et al., 2002) and (Mohri and Riley, 2002) both present an algorithm based on the same idea: use a modi ed A* algorithm with an optimal rest cost estimation. As rest cost estimation, the negated logarithm of the backward probabilities is used. The algorithm in (Uef ng et al., 2002) has two disadvantages: it does not care about duplicates and the rest cost computation is suboptimal as the described algorithm has an exponential worst-case complexity.</Paragraph> <Paragraph position="2"> As mentioned in the previous section, the backward probabilities can be computed in linear time.</Paragraph> <Paragraph position="3"> In (Mohri and Riley, 2002) the word graph is represented as a weighted nite state automaton. The word graph is rst determinized, i.e. the nondeterministic automaton is transformed in an equivalent deterministic automaton. This process removes the duplicates from the word graph. Out of this determinized word graph, the N best candidates are extracted. In (Mohri and Riley, 2002), epsilon1-transitions are ignored, i.e. transitions that do not produce a word.</Paragraph> <Paragraph position="4"> These epsilon1-transitions usually occur in the backing-off case of language models. The epsilon1-transitions have to be removed before using the algorithm of (Mohri and Riley, 2002). In the presence of epsilon1-transitions, two path representing the same string are considered equal only if the epsilon1-transitions are identical as well.</Paragraph> </Section> </Section> <Section position="6" start_page="193" end_page="194" type="metho"> <SectionTitle> 4 Evaluation Criteria </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="193" end_page="193" type="sub_section"> <SectionTitle> 4.1 String-To-String Criteria </SectionTitle> <Paragraph position="0"> To evaluate the single-best translation hypotheses, we use the following string-to-string criteria: word error rate (WER), position-independent word error rate (PER) and the BLEU score. More details on these standard criteria can be found for instance in (Och, 2003).</Paragraph> </Section> <Section position="2" start_page="193" end_page="194" type="sub_section"> <SectionTitle> 4.2 Graph-To-String Criteria </SectionTitle> <Paragraph position="0"> To evaluate the quality of the word graphs, we generalize the string-to-string criteria to work on word graphs. We will use the well-known graph word error rate (GWER), see also (Uef ng et al., 2002). Additionally, we introduce two novel graph-to-string criteria, namely the position-independent graph word error rate (GPER) and the graph BLEU score (GBLEU). The idea of these graph-to-string criteria is to choose a sequence from the word graph and compute the corresponding string-to-string criterion for this speci c sequence. The choice of the sequence is such that the criterion is the optimum over all possible sequences in the word graph, i.e.</Paragraph> <Paragraph position="1"> the minimum for GWER/GPER and the maximum for GBLEU.</Paragraph> <Paragraph position="2"> The GWER is a generalization of the word error rate. It is a lower bound for the WER. It can be computed using a dynamic programming algorithm which is quite similar to the usual edit distance computation. Visiting the nodes of the word graph in topological order helps to avoid repeated computations. null The GPER is a generalization of the position-independent word error rate. It is a lower bound for the PER. The computation is not as straightforward as for the GWER.</Paragraph> <Paragraph position="3"> In (Uef ng and Ney, 2004), a method for computing the string-to-string PER is presented. This method cannot be generalized for the graph-to-string computation in a straightforward way. Therefore, we will rst describe an alternative computation for the string-to-string PER and then use this idea for the graph-to-string PER.</Paragraph> <Paragraph position="4"> Now, we want to compute the number of position-independent errors for two strings. As the word order of the strings does not matter, we represent them as multisets1 A and B. To do this, it is suf cient to know how many words are in A but not in B, i.e.</Paragraph> <Paragraph position="5"> a := jA!Bj, and how many words are in B but not in A, i.e. b := jB!Aj. The number of substitutions, insertions and deletions are then:</Paragraph> <Paragraph position="7"> It is obvious that there are either no insertions or no deletions. The PER is then computed as the number of errors divided by the length of the reference string.</Paragraph> <Paragraph position="8"> Now, back to the graph-to-string PER computation. The information we need at each node of the word graph are the following: the remaining multi-set of words of the reference string that are not yet produced. We denote this multiset C. The cardinality of this multiset will become the value a in the preceding notation. In addition to this multiset, we also need to count the number of words that we have produced on the way to this node but which are not in the reference string. The identity of these words is not important, we simply have to count them. This count will become the value b in the preceding notation. null If we make a transition to a successor node along an edge labeled w, we remove that word w from the set of remaining reference words C or, if the word w is not in this set, we increase the count of words that are in the hypothesis but not in the reference.</Paragraph> <Paragraph position="9"> To compute the number of errors on a graph, we use the auxiliary quantity Q(n, C), which is the count of the produced words that are not in the reference. We use the following dynamic programming recursion equations:</Paragraph> <Paragraph position="11"> Here, n0 denote the root node of the word graph, C0 denotes the multiset representation of the reference string. As already mentioned in Section 3.1, (nprime, n, w) denotes an edge from node nprime to node n with word label w.</Paragraph> <Paragraph position="12"> In the implementation, we use a bit vector to represent the set C for ef ciency reasons. Note that in the worst-case the size of the Q-table is exponential in the length of the reference string. However, in practice we found that in most cases the computation is quite fast.</Paragraph> <Paragraph position="13"> The GBLEU score is a generalization of the BLEU score. It is an upper bound for the BLEU score. The computation is similar to the GPER computation. We traverse the word graph in topological order and store the following information: the counts of the matching n-grams and the length of the hypothesis, i.e. the depth in the word graph. Additionally, we need the multiset of reference n-grams that are not yet produced.</Paragraph> <Paragraph position="14"> To compute the BLEU score, the n-gram counts are collected over the whole test set. This results in a combinatorial problem for the computation of the GBLEU score. We process the test set sentence-wise and accumulate the n-gram counts. After each sentence, we take a greedy decision and choose the n-gram counts that, if combined with the accumulated n-gram counts, result is the largest BLEU score.</Paragraph> <Paragraph position="15"> This gives a conservative approximation of the true GBLEU score.</Paragraph> </Section> <Section position="3" start_page="194" end_page="194" type="sub_section"> <SectionTitle> 4.3 Word Graph Size </SectionTitle> <Paragraph position="0"> To measure the word graph size we use the word graph density, which we de ne as the number of edges in the graph divided by the source sentence length.</Paragraph> </Section> </Section> class="xml-element"></Paper>