File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1059_intro.xml
Size: 2,489 bytes
Last Modified: 2025-10-06 14:03:04
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1059"> <Title>Stochastic Lexicalized Inversion Transduction Grammar for Alignment</Title> <Section position="3" start_page="479" end_page="480" type="intro"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> We trained both the unlexicalized and the lexicalized ITGs on a parallel corpus of Chinese-English newswire text. The Chinese data were automatically segmented into tokens, and English capitalization was retained. We replaced words occurring only once with an unknown word token, resulting in a Chinese vocabulary of 23,783 words and an English vocabulary of 27,075 words.</Paragraph> <Paragraph position="1"> In the first experiment, we restricted ourselves to sentences of no more than 15 words in either language, resulting in a training corpus of 6,984 sentence pairs with a total of 66,681 Chinese words and 74,651 English words. In this experiment, we didn't apply the pruning techniques for the lexicalized ITG.</Paragraph> <Paragraph position="2"> In the second experiment, we enabled the pruning techniques for the LITG with the beam ratio for the tic-tac-toe pruning as 10[?]5 and the number k for the top-k pruning as 25. We ran the experiments on sentences up to 25 words long in both languages. The resulting training corpus had 18,773 sentence pairs with a total of 276,113 Chinese words and 315,415 English words.</Paragraph> <Paragraph position="3"> We evaluate our translation models in terms of agreement with human-annotated word-level alignments between the sentence pairs. For scoring the Viterbi alignments of each system against gold-standard annotated alignments, we use the alignment error rate (AER) of Och and Ney (2000), which measures agreement at the level of pairs of words:</Paragraph> <Paragraph position="5"> where A is the set of word pairs aligned by the automatic system, GS is the set marked in the gold standard as &quot;sure&quot;, and GP is the set marked as &quot;possible&quot; (including the &quot;sure&quot; pairs). In our Chinese-English data, only one type of alignment was marked, meaning that GP = GS.</Paragraph> <Paragraph position="6"> In our hand-aligned data, 20 sentence pairs are less than or equal to 15 words in both languages, and were used as the test set for the first experiment, and 47 sentence pairs are no longer than 25 words in either language and were used to evaluate the pruned</Paragraph> </Section> class="xml-element"></Paper>