File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1012_evalu.xml
Size: 7,867 bytes
Last Modified: 2025-10-06 13:58:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1012"> <Title>Extensions to HMM-based Statistical Word Alignment Models</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> We present results on word level alignment accuracy using the Hansards corpus. Our test data consists of a191 a179a6a179 manually aligned sentences which are the same data set used by (Och and Ney, 2000b).6 In the annotated sentences every alignment between two words is labeled as either a sure (S) or possible (P) alignment. (S a192 P). We used the following quantity (called alignment error rate or AER) to evaluate the alignment quality of our models, which is also the evaluation metric used by (Och and Ney, 2000b): We divided this annotated data into a validation set of a144a84a179a6a179 sentences and a final test set of a119a47a179a6a179 sentences. The validation set was used to select tuning parameters such as a148 in Eqn. 4, 6 and 7. We report AER results on the final test set of a119a47a179a6a179 sentences which contain a total of a206 a40a80a207 a206a6a191 English and a208 a40a42a179a56a144a23a209 French words. We experimented with training corpora of different sizes ranging from 5K to 50K sentences. We concentrated on small to medium data sets to assess the ability of our models to deal with sparse data.</Paragraph> <Paragraph position="1"> Table 1 shows the percentage of words in the corpus that were seen less than the specified number of times. For example, in our 10K training corpus a119 a208a6a210 of all word types were seen only once. As seen from the table the sparsity is great even for large corpora.</Paragraph> <Paragraph position="2"> The models we implemented and compare in this section are the following: a211 Baseline is the baseline HMM model described in section 2 a211 Tags is an HMM model that includes tags for translation probabilities (section 5.1) 6We want to thank Franz Och for sharing the annotated data with us.</Paragraph> <Paragraph position="3"> a211 SG is an HMM model that includes stay probabilities (section 5.3) a211 Null is an HMM model that includes the new generation model for words by Null (section 5.4) a211 Tags+Null, Tags+SG, and Tags+Null+SG are combinations of the above models Table 2 shows AER results for our improved models on training corpora of increasing size. The model Null outperforms the baseline at every data set size,with the error reduction being larger for bigger training sets (up to 9.2% error reduction). The SG model reduces the baseline error rate by up to 10%. The model Tags reduces the error rate for the smallest dataset by 7.6%. The combination of Tags and the SG or Null models outperforms the individual models in the combination since they address different problems and make orthogonal mistakes.</Paragraph> <Paragraph position="4"> The combination of SG and Tags reduces the base-line error rate by up to 16% and the combination of Null and Tags reduces the error rate by up to 12.3%.</Paragraph> <Paragraph position="5"> All of these error reductions are statistically significant at the a18a116a33a118a117a212a179 a191 confidence level according to the paired t-test. The combination Tags+Null+SG further reduces the error rate. For small datasets, there seems to be a stronger overlap between the strengths of the Null and SG models because some fertility related phenomena can be accounted for by both models. When an English word is wrongly aligning to several consecutive French words because of indirect association, while the correct alignment of some of them is to the empty word, both the Null and SG models can combat the problem-- one by better modeling correspondence to Null, and the other by discouraging large fertilities.</Paragraph> <Paragraph position="6"> Figure 2 displays learning curves for three models: Och, Tags, and Tags+Null. Och is the HMM alignment model of (Och and Ney, 2000b). To obtain results from the Och model we ran GIZA++.7 Both the Tags and Och models use word classes.</Paragraph> <Paragraph position="7"> However the word classes used in the latter are learned automatically from parallel bilingual corpora while the classes used in the former are human defined part of speech tags. Figure 2 shows that the Tags model outperforms the Och model when the training data size is small. As the train-</Paragraph> <Paragraph position="9"> ing size increases the Och model catches up with the Tags model and even surpasses it slightly. This suggests that when large amounts of parallel text are not available monolingual part of speech classes can improve alignment quality more than automatically induced classes. When more data is available automatically induced bilingual word classes seem to provide more improvement but it still remains to be explored whether the combination of part-of-speech knowledge with induction of bilingual classes will perform even better. The third curve in the figure for Tags+Null illustrates the relative improvement of the Null model over the Tags model as the training set size increases. We see that the performance gap between the two models becomes wider for larger training data sizes. This reflects the improved estimation of the generation probabilities for Null which require target word specific parameters. We used both paired t-test and Wilcoxon signed rank tests to show the improvements are statistically significant.</Paragraph> <Paragraph position="10"> The signed rank test uses the normalized test statis- null . a220 a155 is the sum of the ranks that have positive signs. Ties are assigned the average rank of the tied group. Since there are 400 test sentences, we have 400 paired samples where the elements of each pair are the AERs of the models being compared.</Paragraph> <Paragraph position="11"> The difference between Och and Tags at 5K, 10K, and 15K is significant at the a18a221a33a205a179a74a117a212a179 a191 level according to both tests. The difference between Och and Tags+Null is significant for all training set sizes at the a18a116a33a151a179a74a117a212a179 a191 level.</Paragraph> <Paragraph position="12"> We also assessed the gains from using part of speech tags in the alignment probabilities according to the model described in section 5.2. Table 3 shows the error rate of the basic HMM alignment model as compared to an HMM model that conditions on tag sequences of source and target word tags in the neighborhood of the French word a10a21a22 and the English word a3a16a102a44a103a42a131 a110 for a training set size of 10K. The results we achieved showed an improvement of our model over a model that does not include conditioning on tags. The improvement in accuracy is best when using the current and previous French word parts of speech and does not increase when adding more conditioning information. The improvement from part of speech tag sequences for alignment probabilities was not as good as we had expected, however, which leads us to believe that more sophisti- null In Figure 3 we compare the IBM-a119 model to our SG+Tags model. Such a comparison makes sense because IBM-a119 uses a fertility model for English words and SG approximates fertility modeling and because IBM-a119 uses word classes as does our Tags model. For smaller training set sizes our model performs much better than IBM-a119 but when more data is available IBM-a119 becomes slightly better. This confirms the observation from Figure 2 that automatically induced bilingual classes perform better when trained on large amounts of data. Also as our fertility model estimates one parameter for each English word and IBM-a119 estimates as many parameters as the maximum fertility allowed, at small training set sizes our model parameters can be estimated more reliably.</Paragraph> </Section> class="xml-element"></Paper>