File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1027_intro.xml
Size: 6,749 bytes
Last Modified: 2025-10-06 14:06:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1027"> <Title>Should we Translate the Documents or the Queries in Cross-language Information Retrieval?</Title> <Section position="2" start_page="0" end_page="208" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Should we translate the documents or the queries in cross-language information retrieval? The question is more subtle than the implied two alternatives. The need for translation has itself been. questioned : although non-translation based methods of cross-language information retrieval (CLIR), such as cognate-matching (Buckley et al., 1998) and cross-language Latent Semantic Indexing (Dumais et al., 1997) have been developed, the most common approaches have involved coupling information retrieval (IR) with machine translation (MT). (For convenience, we refer to dictionary-lookup techniques and interlingua (Diekema et al., 1999) as &quot;translation&quot; even if these techniques make no attempt to produce coherent or sensibly-ordered language; this distinction is important in other areas, but a stream of words is adequate for IR.) Translating the documents into the query's language(s) and translating the queries into the document's language(s) represent two extreme approaches to coupling MT and IR. These two approaches are neither equivalent nor mutually exclusive. They are not equivalent because machine translation is not an invertible operation. Query translation and document translation become equivalent only if each word in one language is translated into a unique word in the other languages. In fact machine translation tends to be a many-to-one mapping in the sense that finer shades of meaner are distinguishable in the original text than in the translated text. This effect is readily observed, for example, by machine translating the translated text back into the original language. These two approaches are not mutually exclusive, either. We find that a hybrid approach combining both directions of translation produces superior performance than either direction alone. Thus our answer to the question posed by the title is both.</Paragraph> <Paragraph position="1"> Several arguments suggest that document translation should be competitive or superior to query translation. First, MT is error-prone. Typical queries are short and may contain key words and phrases only once. When these are translated inappropriately, the IR engine has no chance to recover. Translating a long document offers the MT engine many more opportunities to translate key words and phrases. If only some of these are translated appropriately, the IR engine has at least a chance of matching these to query terms. The second argument is that the tendency of MT engines to produce fewer distinct words than were contained in the original document (the output vocabulary is smaller than the input vocabulary) also indicates that machine translation should preferably be applied to the documents. Note the types of preprocessing in use by many monolingual IR engines: stemming (or morphological analysis) of documents and queries reduces the number of distinct words in the document index, while query expansion techniques increase the number of distinct words in the query.</Paragraph> <Paragraph position="2"> Query translation is probably the most common approach to CLIR. Since MT is frequently computationally expensive and the document sets in IR are large, query translation requires fewer computer resources than document translation. Indeed, it has been asserted that document translation is simply impractical for large-scale retrieval problems (Carbonell et al., 1997), or that document translation will only become practical in the future as computer speeds improve. In fact, we have developed fast MT algorithms (McCarley and Roukos, 1998) expressly designed for translating large collections of documents and queries in IR.</Paragraph> <Paragraph position="3"> Additionally, we have used them successfully on the TREC CLIR task (Franz et al., 1999). Commercially available MT systems have also been used in large-scale document translation experiments (Oard and Hackett, 1998). Previously, large-scale attempts to compare query translation and document translation approaches to CLIR (Oard, 1998) have suggested that document translation is preferable, but the results have been difficult to interpret. Note that in order to compare query translation and document translation, two different translation systems must be involved. For example, if queries are in English and document are in French, then the query translation IR system must incorporate English=~French translation, whereas the document translation IR system must incorporate French=~English. Since familiar commercial MT systems are &quot;black box&quot; systems, the quality of translation is not known a priori. The present work avoids this difficulty by using statistical machine translation systems for both directions that are trained on the same training data using identical procedures. Our study of document translation is the largest comparative study of document and query translation of which we are currently aware. We also investigate both query and document translation for both translation directions within a language pair.</Paragraph> <Paragraph position="4"> We built and compared three information retrieval systems : one based on document translation, one based on query translation, and a hybrid system that used both translation directions. In fact, the &quot;score&quot; of a document in the hybrid system is simply the arithmetic mean of its scores in the query and document translation systems. We find that the hybrid system outperforms either one alone. Many different hybrid systems are possible because of a tradeoff between computer resources and translation quality.</Paragraph> <Paragraph position="5"> Given finite computer resources and a collection of documents much larger than the collection of queries, it might make sense to invest more computational resources into higher-quality query translation. We investigate this possibility in its limiting case: the quality of human translation exceeds that of MT; thus monolingual retrieval (queries and documents in the same language) represents the ultimate limit of query translation. Surprisingly, we find that the hybrid system involving fast document translation and monolingual retrieval continues to out-perform monolingual retrieval. We thus conclude that the hybrid system of query and document translation will outperform a pure query translation system no matter how high the quality of the query translation.</Paragraph> </Section> class="xml-element"></Paper>