File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0309_metho.xml
Size: 13,723 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0309"> <Title>The Duluth Word Alignment System</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 IBM Model 2 </SectionTitle> <Paragraph position="0"> Model 2 is trained with sentence aligned parallel corpora.</Paragraph> <Paragraph position="1"> However, our goal is learn a model that can perform word alignment, and there are no examples of word alignments given in the training data. Thus, we must cast the training process as a missing data problem, where we learn about word alignments from corpora where only sentence (but not word) alignments are available. As is common with missing data problems, we use the Expectation-Maximization (EM) Algorithm (Dempster et al., 1977) to estimate the probabilities of word alignments in this model.</Paragraph> <Paragraph position="2"> The objective of Model 2 is to estimate the probability that a given sentence pair is aligned a certain way. This is represented by a0a2a1a4a3a6a5 a7a9a8a11a10a13a12 , where a7 is the source sentence, a10 is the target sentence, and a3 is the proposed word alignment for the sentence pair. However, since this probability can't be estimated directly from the training data, we must reformulate it so we can use the EM algorithm.</Paragraph> <Paragraph position="3"> From Bayes Rule we arrive at:</Paragraph> <Paragraph position="5"> where a0a2a1a4a3 a8a13a10 a5 a7 a12 is the probability of a proposed alignment of the words in the target sentence to the words in the given source sentence. To estimate a probability for a particular alignment, we must estimate the numerator and then divide it by the sum of the probabilities of all possible alignments given the source sentence.</Paragraph> <Paragraph position="6"> While clear in principle, there are usually a huge number of possible word alignments between a source and target sentence, so we can't simply estimate this for every possible alignment. Model 2 incorporates a distortion factor to limit the number of possible alignments that are considered. This factor defines the number of positions a source word may move when it is translated into the target sentence. For example, given a distortion factor of two, a source word could align with a word up to two positions to the left or right of the corresponding target word's position.</Paragraph> <Paragraph position="7"> Model 2 is based on the probability of a source and target word being translations of each other, and the probability that words at particular source and target positions are translations of each other (without regard to what those words are). Thus, the numerator in Equation 1 is estimated as follows:</Paragraph> <Paragraph position="9"> The translation probability, a10 a1 a10 a7 a5 a7 a3 a12 a12 , is the likelihood that a10 a7 , the target word at position a15 , is the translation of a given source word a7 that occurs at position a3 a7 . The alignment probability, a3 a1a4a3 a7 a5a15 a8a17a16 a8a19a18 a12 , is the likelihood that position a3 a7 in the source sentence can align to a given position a15 in the target sentence, where a16 and a18 are the given lengths of the source and target sentences.</Paragraph> <Paragraph position="10"> The denominator in Equation 1 is the sum of all the probabilities of all the possible alignments of a sentence pair. This can be estimated by taking the product of the sums of the translational and positional alignment probabilities. null</Paragraph> <Paragraph position="12"> where a27 represents a position in the source sentence and all the other terms are as described previously.</Paragraph> <Paragraph position="13"> The EM algorithm begins by randomly initializing the translation and positional alignment probabilities in Equation 2. Then it estimates Equation 3 based on these values, which are then maximized for all the target words according to Equation 1. The re-estimated translation and positional alignment probabilities are normalized and the EM algorithm repeats the above process for a predetermined number of iterations or until it converges.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 System Components </SectionTitle> <Paragraph position="0"> The Duluth Word Alignment System consists of two pre-processing programs (plain2snt and snt2matrix) and one that learns the word alignment model (model2). These are all implemented in Perl.</Paragraph> <Paragraph position="1"> The plain2snt program converts raw sentence aligned parallel text into the snt format, where each word type in the source and target text is represented as a unique integer. This program also outputs two vocabulary files for the source and target languages that list the word types and their integer values. This is closely modeled after what is done in the GIZA++ tool kit (Och and Ney, 2000b).</Paragraph> <Paragraph position="2"> The snt2matrix program takes the snt file from plain2snt as input and outputs two files. The first is an adjacency list of possible word translations for each sentence pair. The second file consists of a table of alignment positions that were observed in the training corpora. The value of the distortion factor determines which positions may be aligned with each other.</Paragraph> <Paragraph position="3"> The program model2 implements IBM Model 2 as discussed in the previous section. This program requires the vocabulary files, the snt file, the alignment positional probability file and the adjacency list file created by the plain2snt and snt2matrix programs. This program carries out the EM algorithm and estimates the probability of an alignment given the source and target sentences from the snt file. The model2 program outputs a file of word alignments for each of the training sentences and two files containing estimated values for word translation and positional alignment probabilities. Finally, there is also a program (test) that word aligns parallel text based on the output of the model2 program.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Framework </SectionTitle> <Paragraph position="0"> The Duluth Word Alignment System participated in both the English-French (UMD-EF) and Romanian-English (UMD-RE) portions of the shared task on word alignment. null The UMD-RE models were trained using 49,284 sentence pairs of Romanian-English, which was the complete set of training data as provided by the shared task organizers. It is made up of three different types of text: the novel 1984, by George Orwell, which contains 6,429 sentence pairs, the Romanian Constitution which contains 967 sentence pairs, and a set of selected newspaper articles collected from the Internet that contain 41,889 sentences pairs. The gold standard data used in the shared task consists of 248 manually word aligned sentence pairs that were held out of the training process.</Paragraph> <Paragraph position="1"> The UMD-EF models were trained using a 5% sub-set of the Aligned Hansards of the 36th Parliament of Canada (Hansards). The Hansards contains 1,254,001 sentence pairs, which is well beyond the quantity of data that our current system can train with. UMD-EF is trained on a balanced mixture of House and Senate debates and contains 49,393 sentence pairs. The gold standard data used in the shared task consists of 447 manually word aligned sentence pairs that were held out of the training process.</Paragraph> <Paragraph position="2"> The UMD-RE and UMD-EF models were trained for thirty iterations. Three different models for each language pair were trained. These were based on distortion factors of two, four, and six. The resulting models will be referred to as UMD-XX-2, UMD-XX-4 and UMD-XX6, where 2, 4, and 6 are the distortion factor and XX is the language pair (either RE or EF).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> The shared task allowed for two different types of alignments, Sure and Probable. As their names suggest, a sure alignment is one that is judged to be very likely, while a probable is somewhat less certain. The English-French gold standard data included S and P alignments, but our system does not make this distinction, and only outputs S alignments.</Paragraph> <Paragraph position="1"> Submissions to the shared task evaluation were scored using precision, recall, the F-measure and the alignment error rate (AER). Precision is the number of correct alignments (C) out of the total number of alignments attempted by the system (S), while recall is the number of correct alignments (C) out of the total number of correct alignments (A) as given in the gold standard. That is,</Paragraph> <Paragraph position="3"> The F-measure is the harmonic mean of precision and recall:</Paragraph> <Paragraph position="5"> AER is defined by (Och and Ney, 2000a) and accounts for both Sure and Probable alignments in scoring.</Paragraph> <Paragraph position="6"> The word alignment results attained by our models are shown in Table 1. We score and report our results as nonull, since our system does not include null alignments (source words that don't have a target translation). We model precision recall F-measure AER also score relative to sure alignments only. During the shared task systems were scored with and without null alignments in the gold standard, so our results correspond to those without.</Paragraph> <Paragraph position="7"> It is apparent from Table 1 that the precision and recall of the models were not significantly affected by the distortion factor. Also, we note that the precision of the two language pairs is relatively similar. This may reflect that fact that we used approximately the same amount of training data for each language pair. However, note that the recall for English-French is much lower. We continue to investigate why this might be the case, but believe it may be due to the fact that the training data we randomly selected for the Hansards may not have been representative of the gold standard data.</Paragraph> <Paragraph position="8"> Finally, the alignment error rate (AER) is lower (and hence better) for English-French than Romanian-English. However, note that the F-measure for Romanian-English is higher (and therefore better) than English-French. While this may seem contradictory, AER factors in both Sure and Probable alignments into is scoring while only the English-French data included such alignments in its gold standard.</Paragraph> <Paragraph position="9"> The models used for our official submission to the shared task led to somewhat puzzling results, since as the number of iterations increased the precision and recall continued to fall. Upon further investigation, an error was found. Rather than estimating as shown in Equation 1, our system did the following:</Paragraph> <Paragraph position="11"> The results shown in Table 1 are based on a corrected version of the model. Thereafter as the number of iterations increased the accuracy of the results rose and then reached a plateau that was near what is reported here.</Paragraph> <Paragraph position="12"> Table 2 includes the official results as submitted to the shared task based on the flawed model. These are designated as UMD.EF.1, UMD.RE.1, and UMD.RE.2. These use distortion parameters of 2 or 4, and were only trained for 4 iterations. However, it should be noted that the model precision recall F-measure AER</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Future Work </SectionTitle> <Paragraph position="0"> The mystery of why our flawed implementation of Model 2 performed better in some respects than our later repaired version is our current focus of attention. First, we must determine if our corrected Model 2 is really correct, and we are in the process of comparing it with existing implementations, most notably GIZA++. Second, we believe that the relatively small amount of training data might account for the somewhat unpredictable nature of these results. We will do experiments with larger amounts of training data to see if our new implementation improves.</Paragraph> <Paragraph position="1"> However, we are currently unable to train our models in a practical amount of time (and memory) when there are more than 100,000 sentence pairs available. Clearly it is necessary to train on larger amounts of data, so we will be improving our implementation to make this possible.</Paragraph> <Paragraph position="2"> We are considering storing intermediate computations in a database such as Berkeley DB or NDBM in order to reduce the amount of memory our system consumes. We are also considering re-implementing our algorithms in the Perl Data Language (http://pdl.perl.org) which is a Perl module that is optimized for matrix and scientific computing.</Paragraph> <Paragraph position="3"> Our ultimate objective is to extend the model such that it incorporates prior information about cognates or proper nouns that are not translated. Having this information included in the translation probabilities would provide reliable anchors around which other parameter estimates could be made.</Paragraph> <Paragraph position="4"> Finally, having now had some experience with IBM Models 1 and 2, we we will continue on to explore IBM Model 3. In addition, we will do further studies with Models 1 and 2 and compare the impact of distortion factors as we experiment with different amounts of training data and different languages.</Paragraph> </Section> class="xml-element"></Paper>