File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2112_evalu.xml
Size: 5,387 bytes
Last Modified: 2025-10-06 13:59:44
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2112"> <Title>Word Alignment for Languages with Scarce Resources Using Bilingual Corpora of Other Language Pairs</Title> <Section position="7" start_page="877" end_page="880" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> In this section, we compare different word alignment methods for Chinese-Japanese alignment. The &quot;Original&quot; method uses the original model trained with the small Chinese-Japanese corpus. The &quot;Basic Induced&quot; method uses the induced model that employs the basic translation probability without introducing cross-language word similarity. The &quot;Advanced Induced&quot; method uses the induced model that introduces the cross-language word similarity into the calculation of the translation probability. The &quot;Interpolated&quot; method uses the interpolation of the word alignment models in the &quot;Advanced Induced&quot; and &quot;Original&quot; methods.</Paragraph> <Section position="1" start_page="879" end_page="879" type="sub_section"> <SectionTitle> 6.1 Data </SectionTitle> <Paragraph position="0"> There are three training corpora used in this paper: Chinese-Japanese (CJ) corpus, Chinese-English (CE) Corpus, and English-Japanese (EJ) Corpus. All of these tree corpora are from general domain. The Chinese sentences and Japanese sentences in the data are automatically segmented into words. The statistics of these three corpora are shown in table 1. &quot;# Source Words&quot; and &quot;# Target Words&quot; mean the word number of the source and target sentences, respectively.</Paragraph> <Paragraph position="1"> Besides the training data, we also have held-out data and testing data. The held-out data includes 500 Chinese-Japanese sentence pairs, which is used to set the interpolated weights described in section 5. We use another 1,000 Chinese-Japanese sentence pairs as testing data, which is not included in the training data and the held-out data. The alignment links in the held-out data and the testing data are manually annotated.</Paragraph> <Paragraph position="2"> Testing data includes 4,926 alignment links .</Paragraph> </Section> <Section position="2" start_page="879" end_page="879" type="sub_section"> <SectionTitle> 6.2 Evaluation Metrics </SectionTitle> <Paragraph position="0"> We use the same metrics as described in Wu et al.</Paragraph> <Paragraph position="1"> (2005), which is similar to those in (Och and Ney, 2000). The difference lies in that Wu et al. (2005) took all alignment links as sure links.</Paragraph> <Paragraph position="2"> If we use to represent the set of alignment links identified by the proposed methods and to denote the reference alignment set, the methods to calculate the precision, recall, f-measure, and alignment error rate (AER) are shown in equations (18), (19), (20), and (21), respectively.</Paragraph> <Paragraph position="3"> It can be seen that the higher the f-measure is, the lower the alignment error rate is. Thus, we will only show precision, recall and AER scores in the evaluation results.</Paragraph> </Section> <Section position="3" start_page="879" end_page="880" type="sub_section"> <SectionTitle> 6.3 Experimental Results </SectionTitle> <Paragraph position="0"> We use the held-out data described in section 6.1 to set the interpolation weights in section 5.</Paragraph> <Paragraph position="2"> set to 0.3, n l is set to 0.1, l for model 3 is set to 0.5, and l for model 4 is set to 0.1. With these parameters, we get the lowest alignment error rate on the held-out data. For each method described above, we perform bi-directional (source to target and target to source) word alignment and obtain two alignment results. Based on the two results, we get a result using &quot;refined&quot; combination as described in (Och and Ney, 2000). Thus, all of the results reported here describe the results of the &quot;refined&quot; combination. For model training, we use the The evaluation results on the testing data are shown in table 2. From the results, it can be seen that both of the two induced models perform better than the &quot;Original&quot; method that only uses the limited Chinese-Japanese sentence pairs. The &quot;Advanced Induced&quot; method achieves a relative error rate reduction of 10.41% as compared with the &quot;Original&quot; method. Thus, with the Chinese-English corpus and the English-Japanese corpus, we can achieve a good word alignment results even if no Chinese-Japanese parallel corpus is available. After introducing the cross-language word similarity into the translation probability, the &quot;Advanced Induced&quot; method achieves a relative error rate reduction of 7.40% as compared with the &quot;Basic Induced&quot; method. It indicates that cross-language word similarity is effective in the calculation of the translation probability. Moreover, the &quot;interpolated&quot; method further improves the result, which achieves relative error It is located at http://www.fjoch.com/ GIZA++.html. rate reductions of 12.51% and 21.30% as compared with the &quot;Advanced Induced&quot; method and the &quot;Original&quot; method.</Paragraph> </Section> </Section> class="xml-element"></Paper>