File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1041_metho.xml
Size: 23,305 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1041"> <Title>Effective Phrase Translation Extraction from Alignment Models</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Motivation </SectionTitle> <Paragraph position="0"> Alignment models associate words and their translations at the sentence level creating a translation lexicon across the language pair. For each sentence pair, the model also presents the maximally likely association between each source and target word across the sentence pair, forming an alignment map for each sentence pair in the training corpus. The most likely alignment pattern between a source and target sentence under the trained alignment model will be referred to as the maximum approximation, which under HMM alignment (Vogel et al., 1996) model corresponds to the Viterbi path. A set of words in the source sentence associated with a set of words in the target sentence is considered a phrasal pair and forms a partition within the alignment map. Figure a40 . shows a source and target sentence pair with points indicating alignment points.</Paragraph> <Paragraph position="1"> A phrasal translation pair within a sentence pair can be represented as the 4-tuple hypothesis a41a43a42 a0a15a44a39a45a39a46a15a47a35a45a49a48a6a45a39a46a51a50a52a13 representing an index a0a15a44a39a45a49a48a24a13 and length a0a53a46a54a47a35a45a39a46a51a50a55a13 within the source and the target sentence pair a31 , respectively. The phrasal extraction task involves selecting phrasal hypotheses based on the alignment phrase s2s3 are shown by rounded boxes.</Paragraph> <Paragraph position="2"> model (both the translation lexicon as well as the maximal approximation). The maximal approximation captures context at the sentence level, while the lexicon provides a corpus level translation estimate, motivating the alignment model as a starting point for phrasal extraction. The extraction technique must be able to handle alignments that are only partially correct, as well as cases where the sentence pairs have been incorrectly matched as parallel translations within the corpus. Accommodating for the noisy corpus is an increasingly important component of the translation process, especially when considering languages where no manually aligned parallel corpus is available.</Paragraph> <Paragraph position="3"> Building a phrasal lexicon involves Generation, Scoring, and Pruning steps, corresponding to generating a set of candidate translation pairs, scoring them based on the translation model, and pruning them to account for noise within the data as well as the extraction process.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Generation </SectionTitle> <Paragraph position="0"> The generation step refers to the process of identifying source phrases that require translations and then extracting translations from the alignment model data. We begin by identifying all source language n-grams upto somea60 within the training corpus. When the test sentences that require translation are known, we can simply extract those n-grams that appear in the test sentences. For each of these n-grams, we create a set of candidate translations extracted from the corpus. The primary motivation to restrict the identification step to the test sentence n-grams is savings in computational expense, and the result is a phrasal translation source that extracts translation pairs limited to the test sentences. For each source language n-gram within the pool, we have to find a set of candidate translations. The generation task is formally defined as finding a21a41a62a61 in Equation (1)</Paragraph> <Paragraph position="2"> where a25 is the source n-gram for which we are extracting translations, a21a41 is the set of all partitions, anda31a76a57 refers to the word at position a44 in the source sentence a31 . a21a41 a61 is then the set of all translations for source n-gram a25 , and a77 is a specific translation hypothesis within this set. When considering only those hypothesis translation extracted from a particular sentence paira31 , we use a21a41 a61 a0a31 a13 .</Paragraph> <Paragraph position="3"> We extract these candidates from the alignment map by examining each sentence pair where the source n-gram occurs, and extracting all possible target phrase translations using a sliding window approach. We extract candidate translations of phrase length a40 to a78 , starting at offseta79 toa78a81a80a82a40 . Figure 1. shows circular boxes indicating each potential partition region. One particular partition is indicated by the shading.</Paragraph> <Paragraph position="4"> Over all occurrences of the n-gram within the sentences as well as across sentences, a sizeable candidate pool is generated that attempts the cover the translated usage of the source n-gram a25 within the corpus. This set is large, and contains several spurious translations, and does not consider other source side n-grams within each sentence. The deliberate choice to avoid creating a consistent partitioning of the sentence pairs across n-grams reflects the ability to model partially correct alignments within sentences. This sliding window can be restricted to exclude word-word translations, ie a46a54a47a84a83a3 a40 , a46a51a50a85a83a3 a40 if other sources are available that are known to be more accurate. Now that the candidate pool has been generated, it needs to be scored and pruned to reflect relative confidence between candidate translations and to remove spurious translations due to the sliding window approach.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Scoring </SectionTitle> <Paragraph position="0"> The candidate translations for the source n-gram now need to be scored and ranked according to some measure of confidence. Each candidate translation pair defines a partition within the sentence map, and this partitioning can be scored for confidence in translation quality. We estimate translation confidence by measures from three models; the estimation from the maximum approximation (alignment map), estimation from the word based translation lexicon, and language specific measures. Each of the scoring methods discussed below contributes to the final score under (2)</Paragraph> <Paragraph position="2"> pothesis for a given source n-grama25 . From now on we will refer to aa56a89a88a91a90a26a23a6a92 with regard to a particulara25 implicitly.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Alignment Map </SectionTitle> <Paragraph position="0"> We define two kinds of scores, within sentence consistency and across sentence consistency from the alignment map, in order to represent local and global context effects.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Within Sentence </SectionTitle> <Paragraph position="0"> The partition defined by each candidate translation pair imposes constraints over the maximum approximation hypothesis for sentences in which it occurs. We evaluate the partition by examining its consistency with the maximum approximation hypothesis by considering the alignment hypothesis points within the sentence. An alignment point a102 a42 a0a30 a45a19a103a104a13 (source, target) is said to be consistent if it occurs within the partition defined by a41 a42 a0a15a44a39a45a39a46a47a45a49a48a6a45a39a46a50a13 . a102a106a105a34a107a108 is considered inconsistent in two cases.</Paragraph> <Paragraph position="2"> Each a41a43a42 a0a15a44a39a45a39a46a54a47a12a45a49a48a66a45a39a46a67a50a52a13 in a21a41 a61 a0a31 a13 ( a44a115a7a9a7a9a7a121a44 +a46a15a47 defines a25 ) determines a set of consistent and inconsistent points. Figure 1. shows inconsistent points with respect to the shaded partition by drawing an X over the alignment point. The within sentence consistency scoring metric is defined in Equation (5).</Paragraph> <Paragraph position="4"> This measure represents consistency of a41a43a42 a0a15a44a39a45a39a46a15a47a35a45a49a48a6a45a39a46a51a50a52a13 within the maximal approximation alignment for sentence paira31 .</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Across Sentence </SectionTitle> <Paragraph position="0"> Several hypothesis within a21a41 a61 a0a31 a13 are similar or identical to those in a21a41 a61 a0a53a123a6a13 where a31 a83a3a124a123 . We want to score hypothesis that are consistent across sentences higher than those that occur rarely, as the former are assumed to be the correct translations in context. We want to account for different contexts across sentences; therefore we want to highlight similar translations, not simply exact matches. We use a word level Levenstein distance to compare the target side hypotheses within a21a41 a61 . Each element a77 within a21a41 a61 (the complete candidate translation list for a25 ) is assigned the average Levenstein distance with all other elements as its across sentence consistence score; effectively performing a single pass average link clustering to identify the correct translations.</Paragraph> <Paragraph position="2"> where a133a101a135 calculates the Levenshein distance between the target phrases within two hypothesisa77 and</Paragraph> <Paragraph position="4"> The higher the a56a89a88a91a90a26a23a66a92a12a125a47 , the more likely the hypothesis pair is a correct translation. The clustering approach accounts for noise due to incorrect sentence alignment, as well as the different contexts in which a particular source n-gram can be used.</Paragraph> <Paragraph position="5"> As predicted by the formulation of this method, preference is given towards shorter target translations. This effect can be countered by introducing a phrase length model to approximate the difference in phrases lengths across the language boundary. This will be discussed further as a language specific scoring method.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Alignment Lexicon </SectionTitle> <Paragraph position="0"> The methods presented above used the maximum approximation to score candidate translation hypotheses. The translation lexicon generated by the IBM models provides translation estimates at the word level built on the complete training corpus.</Paragraph> <Paragraph position="1"> These corpus level estimates can be integrated into our scoring paradigm to balance the sentence level estimates from the alignment map methods.</Paragraph> <Paragraph position="2"> The translation lexicon provides a conditional probability estimatea31 a0a2a1a105 a33a14a108a13 for each a102 a42 a0a30 a45a19a103a104a13 (a1a105 refers to the word at positiona30 in sentencea31 ) within the maximum approximation. Depending on the direction in which the traditional IBM models are trained, we can either condition on the source or target side, while joint probability models can give us a bidirectional estimate. These translation probability estimates are used to weight thea102 a42 a0a30 a45a19a103a104a13 within the methods described above. Instead of simply count-</Paragraph> <Paragraph position="4"> we sum the probability estimatesa31 a0a2a1a105 a33a14a108a13 for each</Paragraph> <Paragraph position="6"> within the partition where alignment points are predicted by the maximal approximation. The translation lexicon provides estimates at the word level, so we can construct a scoring measure for the complete region within a41a43a42 a0a15a44a39a45a39a46a15a47a35a45a49a48a6a45a39a46a51a50a52a13 that models the complete probability of the partition. The lexical scoring equation below models this effect.</Paragraph> <Paragraph position="8"> This method prefers longer target side phrases due to the sum over the target words within the partition. Although it would also prefer short source side phrases, we are only concerned with comparing hypothesis partitions for a given source n-grama25 .</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Language Specific </SectionTitle> <Paragraph position="0"> The nature of the phrasal association between languages varies depending on the level of inflexion, morphology as well as other factors. The predominant language specific correction to the scoring techniques discussed above models differences in phrase lengths across languages. For example, when comparing English and Chinese translations, we see that on average, the English sentence is approximately 1.3 times longer (under our current segmentation in the small data track). To model these language specific effects, we introduce a phrase length scoring component that is based on the ratio of sentence length between languages. We build a sentence length model based on the DiffRatio statistic defined as a135 a44a55a147a37a147a8a148 a22 a14a52a44a90 a3</Paragraph> <Paragraph position="2"> source sentence length and J is the target sentence length. Let a150a152a151a139a153 be the average a135 a44a55a147a37a147a8a148 a22</Paragraph> <Paragraph position="4"> the sentences in the corpus, and a154a37a155 a151a139a153 be the variance; thereby defining a normal distribution over the DiffRatio statistic. Using the standard Z normalization technique under a normal distribution parameterized by a150 a151a139a153 a45a154 a155 a151a139a153 , we can estimate the probability that a new DiffRatio calculated on the phrasal pair can be generated by the model, giving us the scoring estimate below.</Paragraph> <Paragraph position="6"> To improve the model we might consider examining known phrase translation pairs if this data is available. We explore the language specific difference further by noting that English phrases contain several function words that typically align to the empty Chinese word. We accounted for this effect within the scoring process by treating all target language (English) phrases that only differed by the function words on the phrase boundary as the same translation. The burden of selecting the appropriate hypothesis within the decoding process is moved towards the language model under this corrective strategy. null</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Pruning </SectionTitle> <Paragraph position="0"> The list of candidate translations for each source n-gram a25 is large, and must be pruned to select the most likely set of translations. This pruning is required to ensure that the decoding process remains computationally tractable. Simple threshold methods that rank hypotheses by their final score and only save the top a137 hypotheses will not work here, since phrases differ in the number of possible correct translations they could have when used in different contexts. Given the score ordered set of candidate phrases a21a41 a61 , we would like to label some subset as incorrect translations and remove them from the set.</Paragraph> <Paragraph position="1"> We approach this task as a density estimation problem where we need to separate the distribution of the incorrectly translated hypothesis from the distribution of the likely translations. Instead of using the maximum likelihood criteria, we use the maximal separation criteria ie. selecting a splitting point within the scores to maximize the difference of the mean score between distributions as shown below.</Paragraph> <Paragraph position="3"> where a150 a130a34a165 a42 is the mean score of those hypothesis with a score less thana31 , anda150 a130a34a167 a42 is the mean score of those hypothesis with a greater than or equal to a31 . Once pruning is completed, we convert the scores into a probability measure conditioned on the source n-gram a25 and assign the probability estimate as the translation probability for the hypothesisa77 as shown a0a15a14a35a33a1a26a13 . As mentioned earlier, (Och and Ney, 2002), show that using direction translation estimates in the decoding process as compared with calculating a0a2a1a38a33a14a19a13 as prescribed by the Bayesian framework does not reduce translation quality. Our results corroborate these findings and we use (10) as the phrase level translation model estimate within our decoder.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Integration </SectionTitle> <Paragraph position="0"> Phrase translation pairs that are generated by the method described in this paper are finally scored with estimates of translation probability, which can be conditioned on the target language if necessary.</Paragraph> <Paragraph position="1"> These estimates fit cleanly into the decoding process, except for the issue of phrase length. Traditional word lexicons propose translations for one source word, while with phrase translations, a single hypothesis pair can span several words in the source or target language. Comparing between a path that uses a phrase compared to one that uses multiple words (even if the constituent words are the same) is difficult. The word level pathway involves the product of several probabilities, whereas the phrasal path is represented by one probability score. Potential solutions are to introduce translation length models or to learn scaling factors for phrases of different lengths. Results in this paper have been generated by empirically determining a scaling factor that was inversely proportional to the lenth of the phrase, causing each translation to have a score comparable to the product of the word to word translations within the phrase.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 HMM Phrase Extraction </SectionTitle> <Paragraph position="0"> In order to compare our method to a well understood phrase baseline, we present a method that ex- null pairs, no. of Chinese and English words tracts phrases by harvesting the Viterbi path from an HMM alignment model (Vogel et al., 1996). The HMM alignment model is computationally feasible even for very long sentences, and the phrase extraction method does not have limits on the length of extracted target side phrase. For each source phrase ranging from positions a44a19a5 to a44 refers to an index in the target sentence pair. We calculate phrase translation probabilities (the scores for each extracted phrase) based on a statistical lexicon for the constituent words in the phrase. As the IBM1 alignment model gives the global optimum for the lexical probabilities, this is the natural choice. This leads to the phrase translation probability</Paragraph> <Paragraph position="2"> where a182 and a78 denotes the length of the target phrase a181 a14 , source phrase a181a1 , and the word probabilities a31 a0a2a1a57a33a14a144a13 are estimated using the IBM1 word alignment model. The phrases extracted from this method can be used directly within our in-house decoder without the significant changes that other phrase based methods could require.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Experimentation </SectionTitle> <Paragraph position="0"> IBM alignment models were trained up to model 4 using GIZA (Al Onaizan et al., 1999) from Chinese to English and Chinese to English on two tracks of data. Figures describing the characteristics of each track as well as the test sentences are shown in Table (1). All the data were extracted from a newswire source. We applied our in house segmentation toolkit on the Chinese data and performed basic preprocessing which included; lowercasing, tagging dates, times and numbers on both languages. Translation quality is evaluated by two metrics, (MTEval, 2002) and BLEU (Papeneni et al., 2001), both of which measure n-gram matches between the translated text and the reference translations. NIST is more sensitive to unigram precision due to its emphasis toward high perplexity words.</Paragraph> <Paragraph position="1"> Four reference translations were available for each test sentence. We first compare against a system built using word level lexica only to reiterate the impact of phrase translation, and then show gains by our method over a system that utilizes phrase extracted from the HMM method. The word level system consisted of a hand crafted (Linguistics Data Consortium) bilingual dictionary and a statistical lexicon derived from training IBM model 1. In our experiments we found that although training higher order IBM models does yield lower alignment error rates when measured against manually aligned sentences, the highest translation quality is achieved by using a lexicon extracted from the Model 1 alignment. Experiments were run with a language model (LM) built on a 20 million word news source corpus using our in house decoder which performs a monotone decoding without reordering. To implement our phrase extraction technique, the maximum approximation alignments were combined with the union operation as described in (Och et al., 1999), resulting in a dense but inaccurate alignment map as measured against a human aligned gold standard. Since bi-directional translation models are available, scoring was performed in both directions, using IBM Model 1 lexica for the within sentence scoring. The final phrase level scores computed in each direction were combined by a weighted average before the pruning step. Source side phrases were restricted to be of length 2 or higher since word lexica were available. Weights for each scoring metric were determined empirically against a validation set (alignment map scores were assigned the highest weighting). Table (2) shows results on the small data track, while Table (3) shows results on the large data track. The technique described in this paper is labelleda159 a77a138a23a66a22 a1a92a1 in the tables. The results show that the phrase extraction method described in this paper contribute to statistically significant improvements over the baseline word and phrase level(HMM) systems. When compared against the HMM phrases, our technique show statistically significant improvements. Statistical significance is evaluated by con- null sidering deviations in sentence level NIST scores over the 993 sentence test set with a NIST improvement of 0.05 being statistically significant at the 0.01 alpha level. In combination with the HMM method, our technique delivers further gains, providing evidence that different kinds of phrases have been learnt by each method. The improvements caused by our methods is more apparent in the NIST score rather than the BLEU score. We predict that this effect is due to the language specific correction that treats target phrases with function words at the boundaries as the same phrase. This correction cause the burden to be placed on the language model to select the correct phrase instance from several possible translations. Correctly translating function words dramatically boosts the NIST measure as it places emphasis on high perplexity words ie. those with diverse contexts. null</Paragraph> </Section> class="xml-element"></Paper>