File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2606_metho.xml
Size: 18,554 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2606"> <Title>Reranking Translation Hypotheses Using Structural Properties</Title> <Section position="4" start_page="41" end_page="43" type="metho"> <SectionTitle> 3 Framework </SectionTitle> <Paragraph position="0"> In the following sections, the theoretical framework of statistical machine translation using a direct approach is reviewed. We introduce the supertagging and lightweight dependency analysis approach, link grammars and maximum-entropy based chunking technique.</Paragraph> <Section position="1" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 3.1 Direct approach to SMT </SectionTitle> <Paragraph position="0"> In statistical machine translation, the best trans-</Paragraph> <Paragraph position="2"> using Bayes decision rule. The first probability on the right-hand side of the equation denotes the translation model whereas the second is the target language model.</Paragraph> <Paragraph position="3"> An alternative to this classical source-channel approach is the direct modeling of the posterior probability Pr(eI1|fJ1 ) which is utilized here. Using a log-linear model (Och and Ney, 2002), we where lm are the scaling factors of the models denoted by feature functions hm(*). The denominator represents a normalization factor that depends only on the source sentence fJ1 . Therefore, we can omit it during the search process, leading to the following decision rule:</Paragraph> <Paragraph position="5"> This approach is a generalization of the source-channel approach. It has the advantage that additional models h(*) can be easily integrated into the overall system. The model scaling factors lM1 are trained according to the maximum entropy principle, e.g., using the GIS algorithm. Alternatively, one can train them with respect to the final translation quality measured by an error criterion (Och, 2003). For the results reported in this paper, we optimized the scaling factors with respect to a linear interpolation of word error rate (WER), position-independent word error rate (PER), BLEU and NIST score using the Downhill Simplex algorithm (Press et al., 2002).</Paragraph> </Section> <Section position="2" start_page="41" end_page="42" type="sub_section"> <SectionTitle> 3.2 Supertagging/LDA </SectionTitle> <Paragraph position="0"> Supertagging(BangaloreandJoshi, 1999)usesthe Lexicalized Tree Adjoining Grammar formalism (LTAG) (XTAG Research Group, 2001). Tree AdjoiningGrammarsincorporateatree-rewritingfor- null malism using elementary trees that can be combined by two operations, namely substitution and adjunction, to derive more complex tree structures of the sentence considered. Lexicalization allows us to associate each elementary tree with a lexical item called the anchor. In LTAGs, every elementarytree hassuch alexicalanchor, also calledhead word. It is possible that there is more than one elementary structure associated with a lexical item, as e.g. for the case of verbs with different subcategorization frames.</Paragraph> <Paragraph position="1"> The elementary structures, called initial and auxiliary trees, hold all dependent elements within the same structure, thus imposing constraints on the lexical anchors in a local context. Basically, supertagging is very similar to part-of-speech tagging. Instead of POS tags, richer descriptions, namely the elementary structures of LTAGs, are annotated to the words of a sentence. For this purpose, they are called supertags in order to distinguish them from ordinary POS tags. The result is an &quot;almost parse&quot; because of the dependencies coded within the supertags. Usually, a lexical item can have many supertags, depending on the variouscontextsitappearsin. Therefore, thelocalambiguity is larger than for the case of POS tags. An LTAGparserforthisscenariocanbeveryslow, i.e.</Paragraph> <Paragraph position="2"> its computational complexity is in O(n6), because of the large number of supertags, i.e. elementary trees, that have to be examined during a parse. In order to speed up the parsing process, we can apply n-gram models on a supertag basis in order to filter out incompatible descriptions and thus improve the performance of the parser. In (Bangalore and Joshi, 1999), a trigram supertagger with smoothing and back-off is reported that achieves an accuracy of 92.2% when trained on one million running words.</Paragraph> <Paragraph position="3"> There is another aspect to the dependencies coded in the elementary structures. We can use them to actually derive a shallow parse of the sentence in linear time. The procedure is presented in (Bangalore, 2000) and is called lightweight dependency analysis. The concept is comparable to chunking. The lightweight dependency analyzer (LDA) finds the arguments for the encoded dependency requirements. There exist two types of slots that can be filled. On the one hand, nodes marked for substitution (in a-trees) have to be filled by the complements of the lexical anchor. On the other hand, thefootnodes(i.e.nodesmarkedforadjunction in b-trees) take words that are being modified by the supertag. Figure 1 shows a tree derived by LDA on the sentence the food was very delicious from the C-Star'03 corpus (cf. Section 4.1).</Paragraph> <Paragraph position="4"> The supertagging and LDA tools are available from the XTAG research group website.2 As features considered for the reranking experiments we choose:</Paragraph> <Paragraph position="6"> * Supertagger output: directly use the log-likelihoods as feature score. This did not improveperformancesignificantly,sothemodel null was discarded from the final system.</Paragraph> <Paragraph position="7"> * LDA output: - dependency coverage: determine the number of covered elements, i.e. where the dependency slots are filled to the left and right - separatefeaturesforthenumberofmodifiers and complements determined by the LDA</Paragraph> </Section> <Section position="3" start_page="42" end_page="43" type="sub_section"> <SectionTitle> 3.3 Link grammar </SectionTitle> <Paragraph position="0"> Similar to the ideas presented in the previous section, link grammars also explicitly code dependencies between words (Sleator and Temperley, 1993). These dependencies are called links which reflect the local requirements of each word. Several constraints have to be satisfied within the link grammar formalism to derive correct linkages, i.e.</Paragraph> <Paragraph position="1"> sets of links, of a sequence of words: 1. Planarity: links are not allowed to cross each other 2. Connectivity: links suffice to connect all words of a sentence 3. Satisfaction: linking requirements of each word are satisfied An example of a valid linkage is shown in Figure 2. The link grammar parser that we use is freely available from the authors' website.3 Similar to LTAG, the link grammar formalism is lexicalized which allows for enhancing the methods with probabilistic n-gram models (as is also the case for supertagging). In (Lafferty et al., 1992), the link grammar is used to derive a new class of to the opening bracket denotes the type of chunk, whereas the corresponding POS tag is given after the word.</Paragraph> <Paragraph position="2"> language models that, in comparison to traditional n-gram LMs, incorporate capabilities for expressing long-range dependencies between words. The link grammar dictionary that specifies the words and their corresponding valid links currentlyholdsapproximately60000entriesandhan- null dles a wide variety of phenomena in English. It is derived from newspaper texts.</Paragraph> <Paragraph position="3"> Within our reranking framework, we use link grammar features that express a possible wellformednessofthetranslationhypothesis. Thesimplest feature is a binary one stating whether the link grammar parser could derive a complete linkage or not, which should be a strong indicator of a syntactically correct sentence. Additionally, we added a normalized cost of the matching process which turned out not to be very helpful for rescoring, so it was discarded.</Paragraph> </Section> <Section position="4" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 3.4 ME chunking </SectionTitle> <Paragraph position="0"> Like the methods described in the two preceding sections, text chunking consists of dividing a text into syntactically correlated non-overlapping groups of words. Figure 3 shows again our example sentence illustrating this task. Chunks are represented as groups of words between square brackets. We employ the 11 chunk types as defined for the CoNLL-2000shared task (Tjong Kim Sang and Buchholz, 2000).</Paragraph> <Paragraph position="1"> For the experiments, we apply a maximum-entropy based tagger which has been successfully evaluated on natural language understanding and named entity recognition (Bender et al., 2003).</Paragraph> <Paragraph position="2"> Within this tool, we directly factorize the posterior probability and determine the corresponding chunk tag for each word of an input sequence. We assume that the decisions depend only on a lim-</Paragraph> <Paragraph position="4"/> <Paragraph position="6"> where the step from Eq. 4 to 5 reflects our model assumptions.</Paragraph> <Paragraph position="7"> Furthermore, we have implemented a set of binary valued feature functions for our system, including lexical, word and transition features, prior features, and compound features, cf. (Bender et al., 2003). We run simple count-based feature reduction and train the model parameters using the Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff, 1972). In practice, the training procedure tends to result in an overfitted model. To avoid this, a smoothing method is applied where a Gaussian prior on the parameters is assumed (Chen and Rosenfeld, 1999).</Paragraph> <Paragraph position="8"> Within our reranking framework, we firstly use the ME based tagger to produce the POS and chunk sequences for the different n-best list hypotheses. Given several n-gram models trained on the WSJ corpus for both POS and chunk models, we then rescore the n-best hypotheses and simply use the log-probabilities as additional features. In order to adapt our system to the characteristics of the data used, we build POS and chunk n-gram models on the training corpus part. These domain-specific models are also added to the n-best lists. The ME chunking approach does not model explicit syntactic linkages of words. Instead, it incorporates a statistical framework to exploit valid and syntactically coherent groups of words by additionally looking at the word classes.</Paragraph> </Section> </Section> <Section position="5" start_page="43" end_page="46" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> For the experiments, we use the translation system described in (Zens et al., 2005). Our phrase-based decoder uses several models during search that are interpolated in a log-linear way (as expressed in Eq. 3), such as phrase-based translation models, word-based lexicon models, a language, deletion and simple reordering model and word and phrase penalties. A word graph containing the most likely translation hypotheses is generated during the search process. Out of this compact representation, we extract n-best lists as described in (Zens and Ney, 2005). These n-best lists serve as a starting point for our experiments. The methods presented in Section 3 produce scores that are used as additional features for the n-best lists.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 4.1 Corpora </SectionTitle> <Paragraph position="0"> The experiments are carried out on a subset of the Basic Travel Expression Corpus (BTEC) (Takezawa et al., 2002), as it is used for the supplieddatatrackconditionoftheIWSLTevaluation null campaign. BTEC is a multilingual speech corpus which contains tourism-related sentences similar to those that are found in phrase books. For the supplied data track, the training corpus contains 20000 sentences. Two test sets, C-Star'03 and IWSLT'04, are available for the language pairs Arabic-English, Chinese-English and JapaneseEnglish. null The corpus statistics are shown in Table 1. The average source sentence length is between seven and eight words for all languages. So the task is rather limited and very domain-specific. The advantage is that many different reranking experiments with varying feature function settings can be carried out easily and quickly in order to analyze the effects of the different models.</Paragraph> <Paragraph position="1"> In the following, we use the C-Star'03 set for development and tuning of the system's parameters. After that, the IWSLT'04 set is used as a blind test set in order to measure the performance of the models.</Paragraph> </Section> <Section position="2" start_page="44" end_page="45" type="sub_section"> <SectionTitle> 4.2 Rescoring experiments </SectionTitle> <Paragraph position="0"> The use of n-best lists in machine translation has several advantages. It alleviates the effects of the huge search space which is represented in word graphs by using a compact excerpt of the n best hypotheses generated by the system. Especially for limited domain tasks, the size of the n-best list can be rather small but still yield good oracle error rates. Empirically, n-best lists should have an appropriate size such that the oracle error rate, i.e.</Paragraph> <Paragraph position="1"> the error rate of the best hypothesis with respect to anerrormeasure(suchasWERorPER)isapproximately half the baseline error rate of the system. N-bestlistsaresuitableforeasilyapplyingseveral rescoring techniques since the hypotheses are already fully generated. In comparison, word graph rescoring techniques need specialized tools which can traverse the graph accordingly. Since a node withinawordgraphallowsformanyhistories, one canonlyapplylocalrescoringtechniques,whereas for n-best lists, techniques can be used that consider properties of the whole sentence.</Paragraph> <Paragraph position="2"> For the Chinese-English and Arabic-English task, we set the n-best list size to n = 1500. For Japanese-English, n = 1000 produces oracle error rates that are deemed to be sufficiently low, namely 17.7% and 14.8% for WER and PER, respectively. The single-best output for Japanese-English has a word error rate of 33.3% and position-independent word error rate of 25.9%.</Paragraph> <Paragraph position="3"> For the experiments, we add additional features to the initial models of our decoder that have shown to be particularly useful in the past, such as IBM model 1 score, a clustered language model score and a word penalty that prevents the hypotheses to become too short. A detailed definition of these additional features is given in (Zens et al., 2005). Thus, the baseline we start with is REFE Lenny, she has not come in? BASE How much is it to the? RESC How much is it to the local call? REFE How much is it to the city centre? English test set (IWSLT'04): baseline system (BASE) vs. rescored hypotheses (RESC) and reference translation (REFE).</Paragraph> <Paragraph position="4"> already a very strong one. The log-linear interpolation weights lm from Eq. 3 are directly optimized using the Downhill Simplex algorithm on a linearcombinationofWER(worderrorrate), PER (position-independent word error rate), NIST and BLEU score.</Paragraph> <Paragraph position="5"> In Table 2, we show the effect of adding the presented features successively to the baseline.</Paragraph> <Paragraph position="6"> Separate entries for experiments using supertagging/LDA and link grammars show that a combination of these syntactic approaches always yields some gain in translation quality (regarding BLEU score). The performance of the maximum-entropy based chunking is comparable. A combination of all three models still yields a small improvement. Table 3 shows some examples for the Chinese-English test set. The rescored translations are syntactically coherent, though semantical correctness cannot be guaranteed. On the test data, we achieve an overall improvement of 0.7%, 0.5% and 0.3% in BLEU score for Chinese-English, Japanese-English and Arabic-English, respectively (cf. Tables 4 and 5).</Paragraph> </Section> <Section position="3" start_page="45" end_page="46" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> From the tables, it can be seen that the use of syntactically motivated feature functions within a reranking concept helps to slightly reduce the number of translation errors of the overall translation system. Although the improvement on the IWSLT'04 set is only moderate, the results are nevertheless comparable or better to the ones from (Och et al., 2004), where, starting from IBM model 1 baseline, an additional improvement of only 0.4% BLEU was achieved using more complex methods.</Paragraph> <Paragraph position="1"> For the maximum-entropy based chunking approach, n-grams with n = 4 work best for the chunker that is trained on WSJ data. The domain-specific rescoring model which results from the chunker being trained on the BTEC corpora turns out to prefer higher order n-grams, with n = 6 or more. This might be an indicator of the domain-specific rescoring model successfully capturing more local context.</Paragraph> <Paragraph position="2"> The training of the other models, i.e. supertagging/LDA and link grammar, is also performed on (development set) and IWSLT'04 (test set).</Paragraph> <Paragraph position="3"> out-of-domain data. Thus, further improvements should be possible if the models were adapted to the BTEC domain. This would require the preparation of an annotated corpus for the supertagger and a specialized link grammar, which are both time-consuming tasks.</Paragraph> <Paragraph position="4"> The syntactically motivated methods (supertagging/LDA and link grammars) perform similarly to the maximum-entropy based chunker. It seems that both approaches successfully exploit structural properties of language. However, one outlier is ME chunking on the Chinese-English test data, whereweobservealowerBLEUbutalargerNIST score. For Arabic-English, the combination of all methods does not seem to generalize well on the test set. In that case, supertagging/LDA and link grammar outperforms the ME chunker: the over-all improvement is 1% absolute in terms of BLEU score.</Paragraph> </Section> </Section> class="xml-element"></Paper>