File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-2159_intro.xml

Size: 2,914 bytes

Last Modified: 2025-10-06 14:00:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2159">
  <Title>A Bootstrapping Method for Extracting Bilingual Text Pairs</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A parallel corpus is an important resource for corpus-based approaches to CLIR. These approaches use parallel corpora as statistical training data and then retrieve documents written in a language different from that of the query. One disadvantage of these approaches is lack of resources. Parallel corpora are not always readily available and those that are available tend to be relatively small or to cover only a small number of subjects.</Paragraph>
    <Paragraph position="1"> A bilingual comparable corpus is a set of texts in two different languages from the same domain or on the same topic. Unlike a parallel corpus it is composed independently in the respective language text sets. It can be more readily obtained from the Internet or CD-ROM resources than parallel corpora. Zanettin (1998) introduced several available bilingual comparable corpora such as news paper articles selected by dates and subject codes, medical articles from journals and textbooks, and articles for tourists from brochures and guides. Zanettin (1994) also reported that it is highly likely that much relevant information can be found across languages in a topic-related bilingual comparable corpus. In this paper, we propose a method for extracting bilingual text pairs which share the same information fiom a bilingual colnparable corpus, and show the possibility that the resulting bilingual text pairs can be useful for corpus-based CLIR approaches when we use them as training data instead of a parallel corpus. Sheridan (1998) also proposed an approach to building lnultilingual test collection from comparable corpora consisting of news articles.</Paragraph>
    <Paragraph position="2"> The idea is to reduce the work of manual relevance judgements by restricting news articles to be examined to a couple of days. Disadvantages to this approach are that it relies on time-sensitive texts, texts obtained by this approach are constrained to referencing specific events, and nontrivial work by hulnans is still necessm'y.</Paragraph>
    <Paragraph position="3"> On the other hand, our goal is to extract bilingual text pairs automatically from any kind of bilingual comparable corpora.</Paragraph>
    <Paragraph position="4"> This paper is organized as follows: Section 2 introduces the basic idea for extracting relevant text pairs from a bilingual comparable corpus. Our method is based on a corpus-based CLIR method, so we overview previous corpus-based CLIR approaches in Section 3. Section 4 describes an experimental procedure, the results it produced, and an analysis of the results. The conclusion is given in Section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML