File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3117_evalu.xml
Size: 6,628 bytes
Last Modified: 2025-10-06 13:59:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3117"> <Title>Stochastic Inversion Transduction Grammars for Obtaining Word Phrases for Phrase-based Statistical Machine Translation</Title> <Section position="5" start_page="130" end_page="132" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The experiments in this section were carried out for the shared task proposed in this workshop. This consisted of building a probabilistic phrase translation table for phrase-based statistical machine translation. Evaluation was translation quality on an unseen test set. The experiments were carried out using the Europarl corpus (Koehn, 2005). Table 1 shows the language pairs and some figures of the training corpora. The test set had a0 a6 a92a2a1a4a3 sentences.</Paragraph> <Section position="1" start_page="131" end_page="131" type="sub_section"> <SectionTitle> and Spanish (Es) </SectionTitle> <Paragraph position="0"> A common framework was provided to all the participants so that the results could be compared. The material provided comprised of: a training set, a language model, a baseline translation system (Koehn, 2004), and a word alignment. The participants could augment these items by using: their own training corpus, their own sentence alignment, their own language model, or their own decoder. We only used the provided material for the experiments reported in this work. The BLEU score was used to measure the results.</Paragraph> <Paragraph position="1"> A SITG was obtained for every language pair in this section as described below. The SITG was used to parse paired sentences in the training sample by using the parsing algorithm described in Section 3.</Paragraph> <Paragraph position="2"> All pairs of word phrases that were derived from each internal node in the parse tree, except the root node, were considered for the phrase-based machine translation system. A translation table was obtained from paired word phrases by placing them in the adequate order and counting the number of times that each pair appeared in the phrases. These values were then appropriately normalized (Sanchez and Benedi, 2006).</Paragraph> </Section> <Section position="2" start_page="131" end_page="131" type="sub_section"> <SectionTitle> 4.1 Obtaining a SITG from an aligned corpus </SectionTitle> <Paragraph position="0"> For this experiment, a SITG was constructed for every language pair as follows. The alignment was used to compose lexical rules of the form a26 a27 a31a7a6 . The probability of each rule was obtained by counting. Then, two additional rules of the form a26 a27 a38a26a63a26 a42 and a26 a27 a45a5a26a63a26a43a48 were added. It is important to point out that the constructed SITG did not parse all the training sentences. Therefore, the model was smoothed by adding all the rules of the form a26 a27 a5 a31a16a32 and a26 a27 a32a34a31a7a6 with low probability, so that all the training sentences could be parsed. The rules were then adequately normalized.</Paragraph> <Paragraph position="1"> This SITG was used to obtain word phrases from the training corpus. Then, these word phrases were used by the Pharaoh system (Koehn, 2004) to translate the test set. We used word phrases up to a given length. In these experiments several lengths were tested and the best values ranged from 6 to 10. Table shows 2 the obtained results and the size of the translation table.</Paragraph> <Paragraph position="2"> Lang. BLEU Lang. BLEU rections. The value in parentheses is the number of word phrases in the translation table (in millions). Note that better results were obtained when English was the target language.</Paragraph> </Section> <Section position="3" start_page="131" end_page="132" type="sub_section"> <SectionTitle> 4.2 Using bracketing information in the </SectionTitle> <Paragraph position="0"> parsing As Section 3 describes, the parsing algorithm for SITGs can be adequately modified in order to take bracketed sentences into account. If the bracketing respects linguistically motivated structures, then aligned phrases with linguistic information can be used. Note that this approach requires having quality parsed corpora available. This problem can be reduced by using automatically learned parsers.</Paragraph> <Paragraph position="1"> This experiment was carried out to determine the performance of the translation when some kind of structural information was incorporated in the parsing algorithm described in Section 3. We bracketed the English sentences of the Europarl corpus with an automatically learned parser. This automatically learned parser was trained with bracketed strings obtained from the UPenn Treebank corpus. We then obtained word phrases according to the bracketing by using the same SITG that was described in the previous section. The obtained phrases were used with the Pharaoh system. Table 3 shows the results obtained in this experiment.</Paragraph> <Paragraph position="2"> Note that the results decreased slightly in all rections when word phrases were obtained from a parsed corpus.The value in parentheses is the number of word phrases in the translation table (in millions). null cases. This may be due to the fact that the bracketing incorporated hard restrictions to the paired word phrases and some of them were too forced. In addition, many sentences could not be parsed (up to 5% on average) due to the bracketing. However, it is important to point out that incorporating bracketing information to the English sentences notably accelerated the parsing algorithm, thereby accelerating the process of obtaining word phrases, which is an important detail given the magnitude of this corpus.</Paragraph> </Section> <Section position="4" start_page="132" end_page="132" type="sub_section"> <SectionTitle> 4.3 Combining word phrases </SectionTitle> <Paragraph position="0"> Finally, we considered the combination of both kinds of segments. The results can be seen in Table 4. This table shows that the results improved the results of Table 2 when English was the target language. However, the results did not improve when English was the source language. The reason for this could be that both kinds of segments were different in nature, and, therefore, the number of word phrases increased notably, specially in the English part.</Paragraph> <Paragraph position="1"> Lang. BLEU Lang. BLEU rections when word phrases were obtained from a non-parsed corpus and a parsed corpus.The value in parentheses is the number of word phrases in the translation table (in millions).</Paragraph> </Section> </Section> class="xml-element"></Paper>