File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1059_metho.xml
Size: 4,935 bytes
Last Modified: 2025-10-06 14:09:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1059"> <Title>Stochastic Lexicalized Inversion Transduction Grammar for Alignment</Title> <Section position="4" start_page="480" end_page="480" type="metho"> <SectionTitle> LITG </SectionTitle> <Paragraph position="0"> LITG against the unlexicalized ITG.</Paragraph> <Paragraph position="1"> A separate development set of hand-aligned sentence pairs was used to control overfitting. The sub-set of up to 15 words in both languages was used for cross-validating in the first experiment. The subset of up to 25 words in both languages was used for the same purpose in the second experiment.</Paragraph> <Paragraph position="2"> Table 1 compares results using the full (unpruned) model of unlexicalized ITG with the full model of lexicalized ITG.</Paragraph> <Paragraph position="3"> The two models were initialized from uniform distributions for all rules and were trained until AER began to rise on our held-out cross-validation data, which turned out to be 4 iterations for ITG and 3 iterations for LITG.</Paragraph> <Paragraph position="4"> The results from the second experiment are shown in Table 2. The performance of the full model of unlexicalized ITG is compared with the pruned model of lexicalized ITG using more training data and evaluation data.</Paragraph> <Paragraph position="5"> Under the same check condition, we trained ITG for 3 iterations and the pruned LITG for 1 iteration. For comparison, we also included the results from IBM Model 1 and Model 4. The numbers of iterations for the training of the IBM models were chosen to be the turning points of AER changing on the cross-validation data.</Paragraph> </Section> <Section position="5" start_page="480" end_page="481" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> As shown by the numbers in Table 1, the full lexicalized model produced promising alignment results on sentence pairs that have no more than 15 words on both sides. However, due to its prohibitive O(n8) computational complexity, our C++ implementation of the unpruned lexicalized model took more than 500 CPU hours, which were distributed over multiple machines, to finish one iteration of training. The number of CPU hours would increase to a point that is unacceptable if we doubled the average sentence length. Some type of pruning is a must-have. Our pruned version of LITG controlled the running time for one iteration to be less than 1200 CPU hours, despite the fact that both the number of sentences and the average length of sentences were more than doubled. To verify the safety of the tic-tac-toe pruning technique, we applied it to the unlexicalized ITG using the same beam ratio (10[?]5) and found that the AER on the test data was not changed. However, whether or not the top-k lexical head pruning technique is equally safe remains a question. One noticeable implication of this technique for training is the reliance on initial probabilities of lexical pairs that are discriminative enough. The comparison of results for ITG and LITG in Table 2 and the fact that AER began to rise after only one iteration of training seem to indicate that keeping few distinct lexical heads caused convergence on a suboptimal set of parameters, leading to a form of overfitting. In contrast, overfitting did not seem to be a problem for LITG in the unpruned experiment of Table 1, despite the much larger number of parameters for LITG than for ITG and the smaller training set.</Paragraph> <Paragraph position="1"> We also want to point out that for a pair of long sentences, it would be hard to reflect the inherent bilingual syntactic structure using the lexicalized binary bracketing parse tree. In Figure 2, A(see/vois) echoes IP(see/vois) and B(see/vois) echoes VP(see/vois) so that it means IP(see/vois) is not inverted from English to French but its right child VP(see/vois) is inverted. However, for longer sentences with more than 5 levels of bracketing and the same lexicalized nonterminal repeatedly appearing at different levels, the correspondences would become less linguistically plausible. We think the limitations of the bracketing grammar are another reason for not being able to improve the AER of longer sentence pairs after lexicalization.</Paragraph> <Paragraph position="2"> The space of alignments that is to be considered by LITG is exactly the space considered by ITG since the structural rules shared by them define the alignment space. The lexicalized ITG is designed to be more sensitive to the lexical influence on the choices of inversions so that it can find better alignments. Wu (1997) demonstrated that for pairs of sentences that are less than 16 words, the ITG alignment space has a good coverage over all possibilities. Hence, it's reasonable to see a better chance of improving the alignment result for sentences less than 16 words.</Paragraph> </Section> class="xml-element"></Paper>