File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1068_metho.xml
Size: 24,003 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1068"> <Title>Context-dependent SMT Model using Bilingual Verb-Noun Collocation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> CYCE CT CY BE </SectionTitle> <Paragraph position="0"> C2, where C2 is the number of source words and CYCE</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> CT CY is </SectionTitle> <Paragraph position="0"> the size of the target vocabulary. Even though the number of possible translations of the last two words is much smaller than CYCE</Paragraph> </Section> <Section position="6" start_page="0" end_page="549" type="metho"> <SectionTitle> CT CY BE </SectionTitle> <Paragraph position="0"> , we still need to make further improvement. The main concern is the ex- null ponential explosion from the possible configurations of source words covered by a hypothesis. In order to reduce the number of possible configurations of source words, decoding algorithms based on BT as well as the beam search algorithm have been proposed (Koehn et al., 2004; Och et al., 2001). (Koehn et al., 2004; Och et al., 2001) used heuristics for pruning implausible hypotheses.</Paragraph> <Paragraph position="1"> Our approach to this problem examines the possibility of utilizing context information in a given language pair. Under a given target context, the corresponding source word of a given target word is almost deterministic. Conversely, if a translation pair is given, then the related target or source context is predictable. This implies that if we considered bilingual context information in a given language pair during decoding, we can reduce the computational complexity of the hypothesis search; specifically, we could reduce the possible configurations of source words as well as the number of possible target translations. null In this study, we present a statistical machine translation model as an alternative to the classical IBM-style model. This model is tightly coupled with target language model and utilizes bilingual context information. It is designed to not only reduce the hypothesis search space by decreasing the translation ambiguities but also improve translation performance. It works through reciprocal incorporation between source and target context: source words are determined by the context of previous and corresponding target words, and the next target words are predicted by the current translation pair. Accordingly, we do not need to consider any distortion model or language model as is the case with IBM-style models.</Paragraph> <Paragraph position="2"> Under this framework, we propose a chunk-based translation model for more grammatical, fluent and accurate output. In order to alleviate the data sparseness problem in chunk-based translation, we use a stepwise back-off method in the order of a chunk, sub-parts of the chunk, and word level. Moreover, we utilize verb-noun collocations in dealing with long-distance dependency which are automatically extracted by using chunk alignment and a monolingual dependency parser.</Paragraph> <Paragraph position="3"> As a case study, we developed a Japanese-to-Korean translation model and performed some experiments on the BTEC corpus.</Paragraph> </Section> <Section position="7" start_page="549" end_page="549" type="metho"> <SectionTitle> 2 Overview of Translation Model </SectionTitle> <Paragraph position="0"> The goal of machine translation is to transfer the meaning of a source language sentence, CU . In most types of statistical machine translation, conditional probability C8D6B4CT</Paragraph> <Paragraph position="2"> B5 is used to describe the correspondence between two sentences. This model is used directly for translation by solving the following maximization problem:</Paragraph> <Paragraph position="4"> Since a source language sentence is given and the</Paragraph> <Paragraph position="6"> B5 probability is applied to all possible corresponding target sentences, we can ignore the denominator in equation (3). As a result, the joint probability model can be used to describe the correspondence between two sentences. We apply Markov chain rules to the joint probability model and obtain the following decomposed model: is the index of the source word that is aligned to the word CT</Paragraph> </Section> <Section position="8" start_page="549" end_page="550" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> under the assumption of the fixed one-to-one alignment. In this model, we have two probabilities: AF source word prediction probability under a given target language context, C8D6B4CU The probability of target word prediction is used for selecting the target word that follows the previous target words. In order to make this more deterministic, we use bilingual context, i.e. the translation pair of the preceding target word. For a given target word, the corresponding source word is predicted by source word prediction probability based on the current and preceding target words.</Paragraph> <Paragraph position="1"> Since a target and a source word are predicted through reciprocal incorporation between source and target context from the beginning of a target sentence, the word order in the target sentence is automatically determined and the number of possible configurations of source words is decreased. Thus, we do not need to perform any computation for word re-ordering. Moreover, since correspondences are provided based on bilingual contextual evidence, translation ambiguities can be decreased. As a result, the proposed model is expected to reduce computational complexity during the decoding as well as improve performance.</Paragraph> <Paragraph position="2"> Furthermore, since a word-based translation approach is often incapable of handling complicated expressions such as an idiomatic expressions or complicated verb phrases, it often outputs nonsense translations. To avoid nonsense translations and to increase explanatory power, we incorporate structural aspects of the language into the chunk-based translation model. In our model, one source chunk is translated by exactly one target chunk, i.e., one-to-one chunk alignment. Thus we obtain: where C3 is the number of chunks in a source and a target sentence.</Paragraph> </Section> <Section position="9" start_page="550" end_page="551" type="metho"> <SectionTitle> 3 Chunk-based J/K Translation Model </SectionTitle> <Paragraph position="0"> with Back-Off With the translation framework described above, we built a chunk-based J/K translation model as a case study. Since a chunk-based translation model causes severe data sparseness, it is often impossible to obtain any translation of a given source chunk. In order to alleviate this problem, we apply back-off translation models while giving the consideration to linguistic characteristics.</Paragraph> <Paragraph position="1"> Japanese and Korean is a very close language pair. Both are agglutinative and inflected languages in the word formation of a bunsetsu and an eojeol.Abunsetsu/eojeol consists of two sub parts: the head part composed of content words and the tail part composed of functional words agglutinated at the end of the head part. The head part is related to the meaning of a given segment, while the tail part indicates a grammatical role of the head in a given sentence.</Paragraph> <Paragraph position="2"> By putting this linguistic knowledge to practical use, we build a head-tail based translation model as a back-off version of the chunk-based translation model. We place several constraints on this head-tail based translation model as follows: AF The head of a given source chunk corresponds to the head of a target chunk. The tail of the source chunk corresponds to the tail of a target chunk. If a chunk does not have a tail part, we assign NUL to the tail of the chunk.</Paragraph> <Paragraph position="3"> AF The head of a given chunk follows the tail of the preceding chunk and the tail follows the head of the given chunk.</Paragraph> <Paragraph position="4"> The constraints are designed to maintain the structural consistency of a chunk. Under these constraints, the head-tail based translation can be formulated as the following equation: means the tail of the chunk.</Paragraph> <Paragraph position="5"> In the worst case, even the head-tail based model may fail to obtain translations. In this case, we back it off into a word-based translation model. In the word-based translation model, the constraints on the head-tail based translation model are not applied. The concept of the chunk-based J/K translation framework with back-off scheme can be summarized as follows: 1. Input a dependency-parsed sentence at the chunk level, 2. Apply the chunk-based translation model to the given sentence, 3. If one of chunks does not have any corresponding translation: AF divide the failed chunk into a head and a tail part, verb-noun collocation by using the chunk alignment and a monolingual dependency parser AF back-off the translation into the head-tail based translation model, AF if the head or tail does not have any corresponding translation, apply a word-based translation model to the chunk.</Paragraph> <Paragraph position="6"> Here, the back-off model is applied only to the part that failed to get translation candidates.</Paragraph> <Section position="1" start_page="551" end_page="551" type="sub_section"> <SectionTitle> 3.1 Learning Chunk-based Translation </SectionTitle> <Paragraph position="0"> We learn chunk alignments from a corpus that has been word-aligned by a training toolkit for word-based translation models: the Giza++ (Och and Ney, 2000) toolkit for the IBM models (Brown et al., 1993). For aligning chunk pairs, we consider word(bunsetsu/eojeol) sequences to be chunks if they are in an immediate dependency relationship in a dependency tree. To identify chunks, we use a word-aligned corpus, in which source language sentences are annotated with dependency parse trees by a dependency parser (Kudo et al., 2002) and target language sentences are annotated with POS tags by a part-of-speech tagger (Rim, 2003). If a sequence of target words is aligned with the words in a single source chunk, the target word sequence is regarded as one chunk corresponding to the given source chunk. By applying this method to the corpus, we obtain a word- and chunk-aligned corpus (see Figure 1).</Paragraph> <Paragraph position="1"> From the aligned corpus, we directly estimate the phrase translation probabilities, C8D6B4 based on relative frequencies.</Paragraph> </Section> <Section position="2" start_page="551" end_page="551" type="sub_section"> <SectionTitle> 3.2 Decoding </SectionTitle> <Paragraph position="0"> For efficient decoding, we implement a multi-stack decoder and a beam search with BT algorithm. At each search level, the beam search moves through at most D2-best translation candidates, and a multi-stack is used for partial translations according to the translation cardinality. The output sentence is generated from left to right in the form of partial translations. Initially, we get D2 translation candidates for each source chunk with the beam size D2. Every possible translation is sorted according to its translation probability. We start the decoding with the initialized beams and initial stack CB</Paragraph> </Section> </Section> <Section position="10" start_page="551" end_page="551" type="metho"> <SectionTitle> BC </SectionTitle> <Paragraph position="0"> , the top of which has the information of the initial hypothesis, CWDICT</Paragraph> <Paragraph position="2"> B0CX. The decoding algorithm is described in Table 1.</Paragraph> <Paragraph position="3"> In the decoding algorithm, estimating the backward score is so complicated that the computational complexity becomes too high because of the context consideration. Thus, in order to simplify this problem, we assume the context-independence of only the backward score estimation. The backward score is estimated by the translation probability and language model score of the uncovered segments. For each uncovered segment, we select the best translation with the highest score by multiplying the translation probability of the segment by its language model score. The translation probability and language model score are computed without giving consideration to context.</Paragraph> <Paragraph position="4"> After estimating the forward and backward score of each partial translation on stack CB - Check the head-tail consistency - Mark the source segment as a covered one - Estimate forward and backward score - Push the state of pair CWDICT multi-stack decoding algorithm prune the hypotheses. In pruning, we first sort the partial translations on stack CB</Paragraph> </Section> <Section position="11" start_page="551" end_page="551" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> according to their scores. If the gradient of scores steeply decreases over the given threshold at the CZ D8CW translation, we prune the translations of lower scores than the CZ D8CW one. Moreover, if the number of filtered translations is larger than C6, we only take the top C6 translations. As a final translation, we output the single best translation.</Paragraph> </Section> <Section position="12" start_page="551" end_page="551" type="metho"> <SectionTitle> 4 Resolving Long-distance Dependency </SectionTitle> <Paragraph position="0"> Since most of the current translation models take only the local context into account, they cannot account for long-distance dependency. This often causes syntactically or semantically incorrect translation to be output. In this section, we describe how this problem can be solved. For handling the long-distance dependency problem, we utilize bilingual verb-noun collocations that are automatically acquired from the chunk-aligned bilingual corpora.</Paragraph> <Section position="1" start_page="551" end_page="551" type="sub_section"> <SectionTitle> 4.1 Automatic Extraction of Bilingual Verb-Noun Collocation(BiVN) </SectionTitle> <Paragraph position="0"> To automatically extract the bilingual verb-noun collocations, we utilize a monolingual dependency parser and the chunk alignment result. The basic concept is the same as that used in (Hwang et al., 2004): bilingual dependency parses are obtained by sharing the dependency relations of a monolingual dependency parser among the aligned chunks. Then bilingual verb sub-categorization patterns are acquired by navigating the bilingual dependency trees.</Paragraph> <Paragraph position="1"> A verb sub-categorization is the collocation of a verb and all of its argument/adjunct nouns, i.e. verb-noun collocation(see Figure 1).</Paragraph> <Paragraph position="2"> To acquire more reliable and general knowledge, we apply the following filtering method with statis-</Paragraph> </Section> </Section> <Section position="13" start_page="551" end_page="551" type="metho"> <SectionTitle> tical AV BE </SectionTitle> <Paragraph position="0"> test and unification operation: AF step 1. Filter out the reliable translation correspondences from all of the alignment pairs by</Paragraph> </Section> <Section position="14" start_page="551" end_page="551" type="metho"> <SectionTitle> AV BE </SectionTitle> <Paragraph position="0"> test at a probability level of AB</Paragraph> </Section> <Section position="15" start_page="551" end_page="551" type="metho"> <SectionTitle> BD </SectionTitle> <Paragraph position="0"> AF step 2. Filter out reliable bilingual verb-noun collocations BiVN by a unification and AV both of them are reliable pairs filtered in step 1. and they share the same verb pair CWDA</Paragraph> <Section position="1" start_page="551" end_page="551" type="sub_section"> <SectionTitle> 4.2 Application of BiVN </SectionTitle> <Paragraph position="0"> The acquired BiVN is used to evaluate the bilingual correspondence of a verb-noun pair dependent on each other and to select the correct translation. It can be applied to any verb-noun pair regardless of the distance between them in a sentence. Moreover, since the verb-noun relation in BiVN is bilingual knowledge, the sense of each corresponding verb and noun can be almost completely disambiguated by each other.</Paragraph> <Paragraph position="1"> In our translation system, we apply this BiVN during decoding as follows: 1. Pivot verbs and their dependents in a given dependency-parsed source sentence 2. When extending a hypothesis, if one of the pivoted verb and noun pairs is covered and its corresponding translation pair is in BiVN,wegive positive weight ACBQBD to the hypothesis.</Paragraph> </Section> </Section> <Section position="16" start_page="551" end_page="551" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="551" end_page="551" type="sub_section"> <SectionTitle> 5.1 Corpus </SectionTitle> <Paragraph position="0"> The corpus for the experiment was extracted from the Basic Travel Expression Corpus (BTEC), a collection of conversational travel phrases for Japanese and Korean (see Table 2). The entire corpus was split into two parts: 162,320 sentences in parallel for training and 10,150 sentences for test. The Japanese sentences were automatically dependency-parsed by CaboCha (Kudo et al., 2002) and the Korean sentences were automatically POS tagged by KUTagger (Rim, 2003)</Paragraph> </Section> <Section position="2" start_page="551" end_page="551" type="sub_section"> <SectionTitle> 5.2 Translation Systems </SectionTitle> <Paragraph position="0"> Four translation systems were implemented for evaluation: 1) Word based IBM-style SMT System(WBIBM), 2) Chunk based IBM-style SMT System(CBIBM), 3) Word based LM tightly Coupled SMT System(WBLMC), and 4) Chunk based LM tightly Coupled SMT System(CBLMC). To examine the effect of BiVN, BiVN was optionally used for each system.</Paragraph> <Paragraph position="1"> The word-based IBM-style (WBIBM) system consisted of a word translation model and a bi-gram language model. The bi-gram language model was generated by using CMU LM toolkit (Clarkson et al., 1997). Instead of using a fertility model, we allowed a multi-word target of a given source word if it aligned with more than one word. We didn't use any distortion model for word re-ordering. And we used a log-linear model language model and the translation model. For decoding, we used a multi-stack decoder based on the</Paragraph> </Section> </Section> <Section position="17" start_page="551" end_page="554" type="metho"> <SectionTitle> BT </SectionTitle> <Paragraph position="0"> algorithm, which is almost the same as that described in Section 3. The difference is the use of the language model for controlling the generation of target translations.</Paragraph> <Paragraph position="1"> The chunk-based IBM-style (CBIBM) system consisted of a chunk translation model and a bi-gram language model. To alleviate the data sparseness problem of the chunk translation model, we applied the back-off method at the head-tail or morpheme level. The remaining conditions are the same as those for WBIBM.</Paragraph> <Paragraph position="2"> The word-based LM tightly coupled (WBLMC) system was implemented for comparison with the chunk-based systems. Except for setting the translation unit as a morpheme, the other conditions are the same as those for the proposed chunk-based translation system.</Paragraph> <Paragraph position="3"> The chunk-based LM tightly coupled (CBLMC) system is the proposed translation system. A bi-gram language model was used for estimating the backward score.</Paragraph> <Section position="1" start_page="551" end_page="554" type="sub_section"> <SectionTitle> 5.3 Evaluation </SectionTitle> <Paragraph position="0"> Translation evaluations were carried out on 510 sentences selected randomly from the test set. The metrics for the evaluations are as follows: PER(Position independent WER), which penalizes without considering positional disfluencies(Niesen et al., 2000).</Paragraph> <Paragraph position="1"> mWER(multi-reference Word Error Rate), which is based on the minimum edit distance between the target sentence and the sentences in the reference set (Niesen et al., 2000).</Paragraph> <Paragraph position="2"> BLEU, which is the ratio of the n-gram for the translation results found in the reference translations with a penalty for too short sentences (Papineni et al., 2001).</Paragraph> <Paragraph position="3"> NIST which is a weighted n-gram precision in combination with a penalty for too short sentences. null For this evaluation, we made 10 multiple references available. We computed all of the above criteria with respect to these multiple references.</Paragraph> </Section> <Section position="2" start_page="554" end_page="554" type="sub_section"> <SectionTitle> 5.4 Analysis and Discussion </SectionTitle> <Paragraph position="0"> Table 3 shows the performance evaluation of each system. CBLMC outperformed CBIBM in overall evaluation criteria. WBLMC showed much better performance than WBIBM in most of the evaluation criteria except for BLEU score. The interesting point is that the performance of WBLMC is close to that of CBIBM in PER and mWER. The BLEU score of WBLMC is lower than that of CBIBM, but the NIST score of WBLMC is much better than that of CBIBM.</Paragraph> <Paragraph position="1"> The reason the proposed model provided better performance than the IBM-style models is because the use of contextual information in CBLMC and WBLMC enabled the system to reduce the translation ambiguities, which not only reduced the computational complexity during decoding, but also made the translation accurate and deterministic. In addition, chunk-based translation systems outperformed word-based systems. This is also strong evidence of the advantage of contextual information.</Paragraph> <Paragraph position="2"> To evaluate the effectiveness of bilingual verb-noun collocations, we used the BiVN filtered with</Paragraph> </Section> </Section> <Section position="18" start_page="554" end_page="555" type="metho"> <SectionTitle> ABBD BP BMBCBHBNABBE BP BMBD, where coverage is BIBGBMBKBIB1 </SectionTitle> <Paragraph position="0"> on the test set and average ambiguity is BEBMBLBL.We suffered a slight loss in the speed by using the BiVN(see Table 4), but we could improve performance in all of the translation systems(see Table 3). In particular, the performance improvement in CBIBM with BiVN was remarkable. This is a positive sign that the BiVN is useful for handling the problem of long-distance dependency. From this result, we believe that if we increased the coverage of BiVN and its accuracy, we could improve the performance much more.</Paragraph> <Paragraph position="1"> Table 4 shows the translation speed of each system. For the evaluation of processing time, we used the same machine, with a Xeon 2.8 GHz CPU and 4GB memory , and checked the time of the best performance of each system. The chunk-based translation systems are much faster than the word-based systems. It may be because the translation ambiguities of the chunk-based models are lower than those of the word-based models. However, the processing speed of the IBM-style models is faster than the proposed model. This tendency can be analyzed from two viewpoints: decoding algorithm and DB system for parameter retrieval. Theoretically, the computational complexity of the proposed model is lower than that of the IBM models. The use of a sorting and pruning algorithm for partial translations provides shorter search times in all system. Since the number of parameters for the proposed model is much more than for the IBM-style models, it took a longer time to retrieve parameters. To decrease the processing time, we need to construct a more efficient DB system.</Paragraph> </Section> class="xml-element"></Paper>