File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1067_evalu.xml
Size: 5,423 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1067"> <Title>A Syntax-based Statistical Translation Model</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 3 Experiment </SectionTitle> <Paragraph position="0"> To experiment, we trained our model on a small English-Japanese corpus. To evaluate performance, we examined alignments produced by the learned model. For comparison, we also trained IBM Model 5 on the same corpus.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Training </SectionTitle> <Paragraph position="0"> We extracted 2121 translation sentence pairs from a Japanese-English dictionary. These sentences were mostly short ones. The average sentence length was 6.9 for English and 9.7 for Japanese.</Paragraph> <Paragraph position="1"> However, many rare words were used, which made the task difficult. The vocabulary size was 3463 tokens for English, and 3983 tokens for Japanese, with 2029 tokens for English and 2507 tokens for Japanese occurring only once in the corpus.</Paragraph> <Paragraph position="2"> Brill's part-of-speech (POS) tagger (Brill, 1995) and Collins' parser (Collins, 1999) were used to obtain parse trees for the English side of the corpus. The output of Collins' parser was 3Note that the algorithm performs full EM counting, whereas the IBM models only permit counting over a sub-set of possible alignments.</Paragraph> <Paragraph position="3"> modified in the following way. First, to reduce the number of parameters in the model, each node was re-labelled with the POS of the node's head word, and some POS labels were collapsed. For example, labels for different verb endings (such as VBD for -ed and VBG for -ing) were changed to the same label VB. There were then 30 different node labels, and 474 unique child label sequences. null Second, a subtree was flattened if the node's head-word was the same as the parent's headword. For example, (NN1 (VB NN2)) was flattened to (NN1 VB NN2) if the VB was a head word for both NN1 and NN2. This flattening was motivated by various word orders in different languages. An English SVO structure is translated into SOV in Japanese, or into VSO in Arabic.</Paragraph> <Paragraph position="4"> These differences are easily modeled by the flattened subtree (NN1 VB NN2), rather than (NN1 (VB NN2)).</Paragraph> <Paragraph position="5"> We ran 20 iterations of the EM algorithm as described in Section 2.2. IBM Model 5 was sequentially bootstrapped with Model 1, an HMM Model, and Model 3 (Och and Ney, 2000). Each preceding model and the final Model 5 were trained with five iterations (total 20 iterations).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Evaluation </SectionTitle> <Paragraph position="0"> The training procedure resulted in the tables of estimated model parameters. Table 1 in Section 2.1 shows part of those parameters obtained by the training above.</Paragraph> <Paragraph position="1"> To evaluate performance, we let the models generate the most probable alignment of the training corpus (called the Viterbi alignment). The alignment shows how the learned model induces the internal structure of the training data.</Paragraph> <Paragraph position="2"> Figure 2 shows alignments produced by our model and IBM Model 5. Darker lines indicates that the particular alignment link was judged correct by humans. Three humans were asked to rate each alignment as okay (1.0 point), not sure (0.5 point), or wrong (0 point). The darkness of the lines in the figure reflects the human score. We obtained the average score of the first 50 sentence pairs in the corpus. We also counted the number of perfectly aligned sentence pairs in the 50 pairs. Perfect means that all alignments in a sentence pair were judged okay by all the human judges.</Paragraph> <Paragraph position="3"> he adores listening to music hypocrisy is abhorrent to them he has unusual ability in english he was ablaze with anger he adores listening to music hypocrisy is abhorrent to them he has unusual ability in english he was ablaze with anger correct by humans.</Paragraph> <Paragraph position="4"> The result was the following;</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Alignment Perfect </SectionTitle> <Paragraph position="0"> ave. score sents Our Model 0.582 10 IBM Model 5 0.431 0 Our model got a better result compared to IBM Model 5. Note that there were no perfect alignments from the IBM Model. Errors by the IBM Model were spread out over the whole set, while our errors were localized to some sentences. We expect that our model will therefore be easier to improve. Also, localized errors are good if the TM is used for corpus preparation or filtering. We also measured training perplexity of the models. The perplexity of our model was 15.79, and that of IBM Model 5 was 9.84. For reference, the perplexity after 5 iterations of Model 1 was 24.01. Perplexity values roughly indicate the predictive power of the model. Generally, lower perplexity means a better model, but it might cause over-fitting to a training data. Since the IBM Model usually requires millions of training sentences, the lower perplexity value for the IBM Model is likely due to over-fitting.</Paragraph> </Section> </Section> class="xml-element"></Paper>