File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-4010_intro.xml

Size: 3,568 bytes

Last Modified: 2025-10-06 14:03:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-4010">
  <Title>Harvesting the Bitexts of the Laws of Hong Kong From the Web</Title>
  <Section position="2" start_page="0" end_page="71" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Bitexts, also referred to as parallel texts or bilingual corpora, collections of bilingual text pairs aligned at various levels of granularity, have been playing a critical role in the current development of machine translation technology. It is such large data sets that give rise to the plausibility of empirical approaches to machine translation, most of which involve the application of a variety of machine learning techniques to infer various types of translation knowledge from bitext data to facilitate automatic translation and enhance translation quality. Large volumes of training data of this kind are indispensable for constructing statistical translation models (Brown et al., 1993; Melamed, 2000), acquiring bilingual lexicon (Gale and Church, 1991; Melamed, 1997), and building example-based machine translation (EBMT) systems (Nagao, 1984; Carl and Way, 2003; Way and Gough, 2003). They also provide a basis for inferring lexical connection between vocabularies in cross-languages information retrieval (Davis and Dunning, 1995).</Paragraph>
    <Paragraph position="1"> Existing parallel corpora have illustrated their particular value in empirical NLP research, e.g., Canadian Hansard Corpus (Gale and Church, 1991b), HK Hansard (Wu, 1994), INTERSECT (Salkie, 1995), ENPC (Ebeling, 1998), the Bible parallel corpus (Resnik et al., 1999) and many others. The Web is being explored not only as a super corpus for NLP and linguistic research (Kilgarriff and Grefenstette, 2003) but also, more importantly to MT research, as a treasure for mining bitexts of various language pairs (Resnik, 1999; Chen and Nie, 2000; Nie and Cai, 2001; Nie and Chen, 2002; Resnik and Smith, 2003; Way and Gough, 2003). The Web has been the playground for many NLPers. More and more Web sites are found to have cloned their Web pages in several languages, aiming at conveying information to audience in different languages. This gives rise to a huge volume of wonderful bilingual or multi-lingual resources freely available from the Web for research. What we need to do is to harvest the right resources for the right applications. In this paper we present our recent work on harvesting English-Chinese parallel texts of the laws of Hong Kong from the Web and construct- null ing a subparagraph-aligned bilingual corpus of about 20 million words. The bilingual texts of the laws is introduced in Section 2, with an emphasis on HK's legislation text hierarchy and its numbering system that can be utilized for text alignment to subparagraph level. Section 3 presents basic methodology and technical details for harvesting and aligning bilingual Web page pairs, extracting content texts from the pages, and aligning text structures in terms of the text hierarchy via utilizing consistent intrinsic features in the Web pages and content texts. Section 4 presents XML schema for encoding the alignment results and illustrates the display mode for browsing the aligned bilingual corpus. Section 5 concludes the paper, highlighting the value of the corpus in term of its volume, translation quality, specificity and comprehensiveness, and alignment granularity. Our future work to explore the Web for harvesting more quantities of parallel bitexts is also briefly outlined.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML