File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-1121_concl.xml
Size: 2,537 bytes
Last Modified: 2025-10-06 13:54:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1121"> <Title>Aligning Bilingual Corpora Using Sentences Location Information*</Title> <Section position="8" start_page="0" end_page="0" type="concl"> <SectionTitle> 7 Conclusion </SectionTitle> <Paragraph position="0"> This paper proposed a new method for fully aligning real bilingual texts using sentence location information, described concretely in section 3 and 4. The model was motivated by the observation that the location of a sentence pair with certain length is distributed in the whole text similarly. It uses the (1:1) sentence beads instead of the high frequency words as the candidate anchors. Local and global location characteristics of sentence pairs are involved to determine the probability which the sentence pair is an alignment anchors.</Paragraph> <Paragraph position="1"> Every sentence pair corresponds to an alignment value which is calculated according to the formal alignment function. Then the process of BAM is performed to get the alignment anchors. This alignment method can restrain the errors extension effectively in comparison to the traditional alignment method. Furthermore, it has shown strong robustness, even if when it meets ill-quality texts that include incorrect sentences. To obtain further improvement in alignment accuracy sentence similarity based on an English-Chinese dictionary was performed. It need not segment the Chinese sentence. The whole procedure requires little cost to implement.</Paragraph> <Paragraph position="2"> Additionally, we can adjust the alignment and similarity thresholds dynamically to get high precision alignment anchors, for example, applying the first test set, even if we get only 105 (1:1) sentence beads but the precision is 100%. We found that this method can perform the function of paragraph alignment very well and ensure simultaneous the alignment precision.</Paragraph> <Paragraph position="3"> Of these pairs about half of total number of (1:1) sentence beads can be even extracted from the bi-lingual text directly to build a large scale bilingual corpus if the original bilingual text is abundant.</Paragraph> <Paragraph position="4"> And the rest bilingual text can be used as spare resource. Now, we have obtained about 500,000 English-Chinese aligned sentence pairs with high quality.</Paragraph> <Paragraph position="5"> In the future, we hope to do further alignment on the basis of current work and extend the method to align other language pairs.</Paragraph> </Section> class="xml-element"></Paper>