File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1121_evalu.xml
Size: 4,394 bytes
Last Modified: 2025-10-06 13:59:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1121"> <Title>Aligning Bilingual Corpora Using Sentences Location Information*</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Results and Discussion </SectionTitle> <Paragraph position="0"> We use the real bilingual texts of the seventeenth chapter in the literary masterpiece &quot;Wuthering Heights&quot; as our test data. The basic information of the data is shown in the table 1.</Paragraph> <Paragraph position="1"> English text size 38.1K Chinese text size 25.1K English sentence number 273 Chinese sentence number 277 Table 1 Basic information of the test data In order to verify the validity of our algorithm, we implement the classic length-based sentence alignment method using dynamic programming.</Paragraph> <Paragraph position="2"> The precision is defined: Precision = The correct aligned sentence pairs / All alignment sentence pairs in bilingual texts The comparison results are presented in table 2. Because the origin bilingual texts have no obvious aligned paragraph boundaries, the error extension phenomena happen easily in the length-based alignment method if the paragraphs are not strictly aligned correctly. Its alignment results are so weaker that it cannot be used. If we omit all of the origin paragraphs information, we merge all the paragraphs in the bilingual text into one larger paragraph respectively. The length-based alignment method rated the precision of 25.4%. This is mainly because the English and Chinese languages don't belong to the same genre and have large difference between the language pairs. But our Figure 5 Anchors selection in Bilingual Texts method rated 129 (1:1) sentence pairs as alignment anchors which divide the bilingual text into aligned fragments. The length-based classic method was applied to these aligned fragments and got a high precision. Fig 6 shows 129 selected anchors distribution which is in the same trend with all the (1:1) sentence beads. Their only difference is the sparse extent of the aligned pairs.</Paragraph> <Paragraph position="3"> In order to evaluate the adaptability of our method, we select texts with different themes and styles as the test set. We merge two news bilingual texts and two novel texts. The data information is show in Table 3.</Paragraph> <Paragraph position="4"> Our method is applied on the fixed data and receives the precision rating of 86.9%. The result shows that this alignment method is theme independent. null English text size 63.9K Chinese text size 41.5K English sentence number 510 Chinese sentence number 526 Table 3 Basic information of the fixed test data (Haruno and Yamazaki, 1996) tried to align short texts in structurally different languages, such as English and Japanese. In this paper the aligned language pairs of English and Chinese belongs to structurally different languages as well. Our method gets the highest precision in aligning short texts. A bilingual news text is selected to be test data. The result is shown in table 4. There are two aligned sentence error pairs which are induced by the lack the corresponding translation.</Paragraph> <Paragraph position="5"> English text size 5.6K Chinese text size 3.4K English sentence number 40 Chinese sentence number 38 It is difficult to attain large test set because doing so need more manual work. We construct the test set by merging the aligned sentence pairs in the existing sentence aligned bilingual corpus into two files. Then the two translated files can be as test set. Here we merge 2000 aligned sentence pairs. The file information is as follows: English text size 200.3K Chinese text size 144.2K English sentence number 2069 Chinese sentence number 2033 Table 5 Basic information of the large test data From the table 4, it is evident that there are many different styles of sentence beads. The method is developed on this large test set and gets the precision of 90.5%. The reason of the slight precision increase is that the last test set is relatively clean and the sentence length distribution relatively average. But overall, our method performs very well to align the real bilingual texts. It shows the high robustness and is not related to the languages, text themes, text length. This method can resolve the alignment problem of the real text.</Paragraph> </Section> class="xml-element"></Paper>