File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1156_intro.xml

Size: 4,317 bytes

Last Modified: 2025-10-06 14:02:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1156">
  <Title>Knowledge Intensive Word Alignment with KNOWA</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Aligning a text and its translation (also known as bitext) at the word level is a basic Natural Language Processing task that has found various applications in recent years. Word level alignments can be used to build bilingual concordances for human browsing, to feed machine learning-based translation algorithms, or as a basis for sense disambiguation algorithms or for automatic projection of linguistic annotations from one language to another.</Paragraph>
    <Paragraph position="1"> A number of word alignment algorithms have been presented in the literature, see for instance (Veronis, 2000) and (Melamed, 2001). Shared evaluation procedures have been established, although there are still open issues on some evaluation details (Ahrenberg et al. 2000).</Paragraph>
    <Paragraph position="2"> Most of the known alignment algorithms are statistics-based and do not exploit external linguistic resources, or use them to a very limited extent. The main attractive of such algorithms is that they are language independent, and only require a parallel corpus of reasonable size to be trained.</Paragraph>
    <Paragraph position="3"> However, word alignment can be used for different purposes and in different application scenarios; different kinds of alignment strategies produce different kinds of results (for instance in terms of precision/recall) which can be more or less suitable to the goal to be achieved. The requirement of having a parallel corpus of adequate size available for training the statistics-based algorithms may be difficult to meet, given that parallel corpora are a precious but often rare resource. For the most common languages, such as English, French, German, Chinese, etc., reference parallel corpora of adequate size are available, and indeed statistics-based algorithms are evaluated on such reference corpora. Unfortunately, if one needs to replicate in a different corpus the results obtained for the reference corpora, finding a parallel corpus of adequate size can be difficult even for the most common languages. Consider that one of the most appealing features of statistics-based algorithms is their ability to induce alignment models for bitexts belonging to very specific domains, an ability which seems to be out of reach for algorithms based on generic linguistic resources. However, for the statistics-based algorithms to achieve their objective, a parallel corpus for the specific domain needs to be available, a requirement that in some cases cannot be met easily.</Paragraph>
    <Paragraph position="4"> For these reasons, we claim that in some cases algorithms based on external, linguistics resources, if available, can be a useful alternative to statistics-based algorithms. In the rest of this paper we will compare the results obtained by a statistics-based and a linguistic resource-based algorithm when applied to the EuroCor and MultiSemCor English/Italian corpora.</Paragraph>
    <Paragraph position="5"> The statistics-based algorithm to be evaluated is described in (Och and Ney, 2003). For its evaluation we used an implementation by the authors themselves, called GIZA++, which is freely available to the scientific community (Och, 2003). The second algorithm to be evaluated is crucially based on a bilingual dictionary and a morphological analyzer. It is called KNOWA (KNowledge intensive Word Aligner) and has been developed at ITC-irst by the authors of this paper. The results of the comparative evaluation show that, given specific application goals, and given the availability of Italian/English resources, KNOWA obtains results that are comparable or better than the results obtained with GIZA++.</Paragraph>
    <Paragraph position="6"> Section 2 describes the basic KNOWA algorithm. Sections 3 and 4 illustrate two enhanced versions of the KNOWA algorithm. Section 5 reports an experiment in which both KNOWA and GIZA++ are first applied to the alignment of a reference parallel corpus, EuroCor, and then to the MultiSemCor corpus. Section 6 adds some conclusive remarks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML