File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1103_intro.xml

Size: 5,393 bytes

Last Modified: 2025-10-06 14:06:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1103">
  <Title>Using a Probabilistic Translation Model for Cross-Language Information Retrieval</Title>
  <Section position="2" start_page="0" end_page="18" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Adequate text processing systems have become widely available for most natural languages. While English remains the dominant language on the Intemet, the relative share of other languages now appears to be on the rise. The network has become truly multilingual. This situation has created an acute need for tools capable of performing language-sensitive search in multilingual databases. In particular, there is a need for tools capable of performing cross-language information retrieval (CLIR), that is, of matching an information query written in one particular language with documents that may be written in one or several different languages.</Paragraph>
    <Paragraph position="1"> Given such a need, the solution that immediately comes to mind is to translate the information query using a machine translation (MT) system, and to feed the resulting translation into a classical monolingual IR system.</Paragraph>
    <Paragraph position="2"> However, it should be stressed that MT and IR have widely divergent concerns. First, observe that MT systems are expected to produce syntactically correct translations and that they tend to spend a lot of effort trying to attain that rather elusive goal. On the other hand, current IR systems tend not to care about grammar : for them texts are mostly viewed as vectors of content words. Second, note that MT systems are expected to select one of the many translations that words may have. For example, in translating the English word &amp;quot;organic&amp;quot; the MT process will be led to select between the French words &amp;quot;organique&amp;quot; and&amp;quot;biologique&amp;quot;. Generally speaking, this selection process is very difficult and MT systems often end up selecting the wrong target language equivalent. Here again what the MT system is expected to do turns out to be unnecessary and maybe undesirable from an IR point of view. As a case in point, classical IR systems often perform a query expansion process by which certain query terms/words are mapped onto several equivalent or related index terms. Not surprisingly, such a process could well make provision for mapping the query word &amp;quot;organique&amp;quot; onto the two index terms &amp;quot;organique&amp;quot; and &amp;quot;biologique&amp;quot; so as to account for (partial) synonymy between these words.</Paragraph>
    <Paragraph position="3"> In other words, MT systems attempt to systematically eradicate translational ambiguity instead of taking advantage of it to capture synonymy relations.</Paragraph>
    <Paragraph position="4"> At the opposite end of the spectrum, MT is replaced with a simple bilingual dictionary lookup. To that end, one can use either an ordinary general-purpose dictionary, a technical terminology database, or both.</Paragraph>
    <Paragraph position="5"> Because of the fact that in any sizable dictionary most words receive many  translations, the dictionary approach will in effect subject the query to a rather massive expansion process. The resulting target language query is likely engender a lot of noise (irrelevant documents that get retrieved), mostly due to the fact that in each dictionary entry some of the translations can correspond to different meanings of the source language word. For example, the English word &amp;quot;drug&amp;quot; is translated in French as &amp;quot;drogue&amp;quot; (an illegal substance) or as &amp;quot;mgdicament&amp;quot; (a legal medicine) depending on the context. There is most often no explicit clue in the query that would allow one to choose the appropriate meaning.</Paragraph>
    <Paragraph position="6"> Yet another approach is to determine translational equivalence automatically, on the basis of a corpus of parallel texts (that is, a corpus made up of source texts and their translations). One way of doing this is to start by establishing translation correspondences between units larger than words, typically sentences. There are now well-known methods for aligning the sentences of parallel corpora (Gale &amp; Church \[6\], Simard, Foster 8~ Isabelle \[10\]). Then, the translational equivalence of a given pair of words can be estimated by their degree of co-occurrence in parallel sentences. Compared to the previous approaches, this has the following advantages: - There is no need to acquire or to compile a bilingual dictionary or a complete MT system.</Paragraph>
    <Paragraph position="7"> - Word translations are made sensitive to the domain, as embodied by the training corpus.</Paragraph>
    <Paragraph position="8"> - As we will see below, it is relatively easy to obtain a suitable degree of query expansion based on translational ambiguity.</Paragraph>
    <Paragraph position="9"> In the next section, we describe the structure of a probabilistic translation model that can calculate pule), the probability of observing wordfj as part of the translation of sentence e. Given a query e, we can then select the n best-scoring values offj as the set of index terms in the target language. This method will be compared to the other two mentioned above.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML