File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1066_evalu.xml
Size: 5,655 bytes
Last Modified: 2025-10-06 13:59:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1066"> <Title>Improving IBM Word-Alignment Model 1</Title> <Section position="9" start_page="1" end_page="3" type="evalu"> <SectionTitle> 8 Results </SectionTitle> <Paragraph position="0"> We report the performance of our different versions of Model 1 in terms of precision, recall, and alignment error rate (AER) as defined by Och and Ney (2003). These three performance statistics are de- null where S denotes the annotated set of sure alignments, P denotes the annotated set of possible alignments, and A denotes the set of alignments produced by the model under test.</Paragraph> <Paragraph position="1"> We take AER, which is derived from F-measure, as our primary evaluation metric.</Paragraph> <Paragraph position="2"> The results of our evaluation are presented in Table 1. The columns of the table present (in order) a description of the model being tested, the AER on the trial data, the AER on the test data, test data recall, and test data precision, followed by the optimal values on the trial data for the LLR exponent, the initial (heuristic model) null-word weight, the null-word weight used in EM re-estimation, the add-n parameter value used in EM re-estimation, and the number of iterations of EM. &quot;NA&quot; means a parameter is not applicable in a particular model. As is customary, alignments to the null word are not explicitly counted.</Paragraph> <Paragraph position="3"> Results for the four principal versions of Model 1 are presented in bold. For each principal version, results of the corresponding ablation experiments are presented in standard type, giving the name of each omitted modification in parentheses.</Paragraph> <Paragraph position="4"> Probably the most striking result is that the heuristic model substantially reduces the AER compared to the standard or smoothed model, even without EM re-estimation. The combined model produces an additional substantial reduction in alignment error, using a single iteration of EM.</Paragraph> <Paragraph position="5"> The ablation experiments show how important the different modifications are to the various models. It is interesting to note that the importance of a given modification varies from model to model. For example, the re-estimation null-word weight makes essentially no contribution to the smoothed model. It can be tuned to reduce the error on the trial data, but the improvement does not carry over to the test data. The smoothed model with only the null-word weight and no add-n smoothing has essentially the same error as the standard model; and the smoothed model with add-n smoothing alone has essentially the same error as the smoothed model with both the null-word weight and add-n smoothing. On the other hand, the re-estimation null-word weight is crucial to the combined model. With it, the combined model has substantially lower error than the heuristic model without re-estimation; without it, for any number of EM iterations, the combined model has higher error than the heuristic model.</Paragraph> <Paragraph position="6"> A similar analysis shows that add-n smoothing is much less important in the combined model than Modificiations are &quot;omitted&quot; by setting the corresponding parameter to a value that is equivalent to removing the modification from the model.</Paragraph> <Paragraph position="7"> the smoothed model. The probable explanation for this is that add-n smoothing is designed to address over-fitting from many iterations of EM. While the smoothed model does require many EM iterations to reach its minimum AER, the combined model, with or without add-n smoothing, is at its minimum AER with only one EM iteration.</Paragraph> <Paragraph position="8"> Finally, we note that, while the initial null-word weight is crucial to the heuristic model without reestimation, the combined model actually performs better without it. Presumably, the re-estimation null-word weight makes the inital null-word weight redundant. In fact, the combined model without the initial null word-weight has the lowest AER on both the trial and test data of any variation tested (note AERs in italics in Figure 1). The relative reduction in AER for this model is 29.9% compared to the standard model.</Paragraph> <Paragraph position="9"> We tested the significance of the differences in alignment error between each pair of our principal versions of Model 1 by looking at the AER for each sentence pair in the test set using a 2-tailed paired t test. The differences between all these models were significant at a level of 10 [?]7 or better, except for the difference between the standard model and the smoothed model, which was &quot;significant&quot; at the 0.61 level--that is, not at all significant. The reason for this is probably the very different balance between precision and recall with the standard and smoothed models, which indicates that the models make quite different sorts of errors, making statistical significance hard to establish. This conjecture is supported by considering the smoothed model omitting the re-estimation null-word weight, which has substantially the same AER as the full smoothed model, but with a precision/recall balance much closer to the standard model. The 2-tailed paired t test comparing this model to the standard model showed significance at a level of better than 10 .</Paragraph> <Paragraph position="10"> We also compared the combined model with and without the initial null-word weight, and found that the improvement without the weight was significant at the 0.008 level.</Paragraph> </Section> class="xml-element"></Paper>