File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1033_metho.xml
Size: 9,763 bytes
Last Modified: 2025-10-06 14:07:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1033"> <Title>Improved Cross-Language Retrieval using Backoff Translation</Title> <Section position="3" start_page="3" end_page="3" type="metho"> <SectionTitle> 2. TRANSLATION LEXICONS </SectionTitle> <Paragraph position="0"> Our term-by-term translation technique (described below) requires a translation lexicon (henceforth tralex) in which each word f is associated with a ranked set fe</Paragraph> <Paragraph position="2"> g of translations. We used two translation lexicons in our experiments.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 2.1 WebDict Tralex </SectionTitle> <Paragraph position="0"> We downloadeda freely available, manually constructedEnglish-French term list from the Web format. Since the WebDict translations appear in no particular order, we ranked the e i based on target language unigram statistics calculated over a large comparable corpus, the English portion of the Cross-LanguageEvaluation Forum (CLEF) collection, smoothed with statistics from the Brown corpus, a balanced corpus covering many genres of English. All single-word translations are ordered by decreasing unigram frequency, followed by all multi-word translations, and finally by any single-word entries not found in either corpus. This ordering has the effect of minimizing the effect of infrequent words in non-standard usages or of misspellings that sometimes appear in bilingual term lists.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 2.2 STRAND Tralex </SectionTitle> <Paragraph position="0"> Our second lexical resource is a translation lexicon obtained fully automatically via analysisof parallel French-Englishdocuments from the Web. A collection of 3,378 document pairs was obtained using STRAND, our technique for mining the Web for bilingual text [7].</Paragraph> <Paragraph position="1"> These document pairs were aligned internally, using their HTML markup, to produce 63,094 aligned text &quot;chunks&quot; ranging in length from 2 to 30 words, 8 words on average per chunk, for a total of 500K words per side. Viterbi word-alignments for these paired chunks were obtained using the GIZA implementation of the IBM statistical translation models.</Paragraph> <Paragraph position="2"> An ordered set of translation pairs was obtained by treating each alignment link between words as a co-occurrence and scoring each word pair according to the likelihood ratio [2]. We then rank the translation alternatives in order of decreasing likelihood ratio score.</Paragraph> </Section> </Section> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 3. CLIR EXPERIMENTS </SectionTitle> <Paragraph position="0"> Ranked tralexes are particularly well suited to a simple ranked term-by-term translation approach. In our experiments, we use top-2 balanced document translation, in which we produce exactly two English terms for each French term. For terms with no known translation, the untranslated French term is generated twice (often appropriate for proper names). For French terms with one translation, that translation is generated twice. For French terms with two or more translations, we generate the first two translations in the tralex. Thus balanced translation has the effect of introducing a uniform weighting over the top n translations for each term (here n =2).</Paragraph> <Paragraph position="1"> Benefits of the approachinclude simplicity and modularity -- notice that a lexicon containing ranked translations is the only requirement, and in particular that there is no need for access to the internals of the IR system or to the document collection in order to</Paragraph> <Paragraph position="3"> perform computations on term frequencies or weights. In addition, the approach is an effective one: in previous experiments we have found that this balancedtranslation strategy significantly outperforms the usual (unbalanced)technique of including all known translations [3].</Paragraph> <Paragraph position="4"> We have also investigated the relationship between balanced translation and Pirkola's structured query formulation method [6].</Paragraph> <Paragraph position="5"> For our experiments we used the CLEF-2000 French document collection (approximately 21 million words from articles in Le Monde).</Paragraph> <Paragraph position="6"> Differences in use of diacritics, case, and punctuation can inhibit matching between tralex entries and document terms, so we normalize the tralex and the documents by converting characters to lowercase and removing all diacritic marks and punctuation. We then translate the documents using the process described above, index the translated documentswith the Inquery information retrieval system, and perform retrieval using &quot;long&quot; queries formulated by grouping all terms in the title, narrative, and description fields of each English topic description using Inquery's #sum operator. We report mean average precision on the 34 topics for which relevant French documentsexist, basedon the relevancejudgments provided by CLEF.</Paragraph> <Paragraph position="7"> We evaluated several strategies for using the WebDict and STRAND tralexes.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.1 WebDict Tralex </SectionTitle> <Paragraph position="0"> Since a tralex may contain an eclectic mix of root forms and morphological variants, we use a four-stage backoff strategy to maximize coverage while limiting spurious translations: 1. Match the surface form of a document term to surface forms of French terms in the tralex.</Paragraph> <Paragraph position="1"> 2. Match the stem of a document term to surface forms of French terms in the tralex.</Paragraph> <Paragraph position="2"> 3. Match the surface form of a document term to stems of French terms in the tralex.</Paragraph> <Paragraph position="3"> 4. Match the stem of a document term to stems of French terms in the tralex.</Paragraph> <Paragraph position="4"> We used unsupervisedinduction of stemming rules basedon the French collection to build the stemmer [5]. The process terminates as soon as a match is found at any stage, and the known translations for that match are generated. The process may produce an inappropriate morphological variant for a correct English translation, so we used Inquery's English kstem stemmer at indexing time to minimize the effect of that factor on retrieval effectiveness.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.2 STRAND Tralex </SectionTitle> <Paragraph position="0"> One limitation of a statistically derived tralex is that any term has some probability of aligning with any other term. Merely sorting translation alternatives in order of decreasing likelihood ratio will thus find some translation alternatives for every French term that appeared at least once in the set of parallel Web pages. In order to limit the introduction of spurious translations, we included only translation pairs with at least N co-occurrences in the set used to build the tralex. We performed runs with N =1;;2;;3, using the four-stage backoff strategy described above.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.3 WebDict Merging using STRAND </SectionTitle> <Paragraph position="0"> When two sources of evidence with different characteristics are available, a combination-of-evidence strategy can sometimes out-perform either source alone. Our initial experiments indicated that the WebDict tralex was the better of the two (see below), so we adopted a reranking strategy in which the WebDict tralex was refined according a voting strategy to which both the original WebDict and</Paragraph> <Paragraph position="2"> top-ranked translation in each tralex a score of 100, the next a score of 99, and so on. We then summed the WebDict and STRAND scores for each translation, reranked the WebDict translations based on that sum, and then appendedany STRAND-only translations for that French term. Thus, although both sourcesof evidence were weighted equally in the voting, STRAND-only evidence received lower precedence in the merged ranking. For French terms that appeared in only one tralex, we included those entries unchangedin the merged tralex. In this experiment run we used a threshold of N =1, and applied the four-stage backoff strategy described above to the merged resource.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.4 WebDict Backoff to STRAND </SectionTitle> <Paragraph position="0"> A possibleweaknessof our merging strategy is that inflected forms are more common in our STRAND tralex, while root forms are more common in our WebDict tralex. STRAND tralex entries that were copied unchangedinto the merged tralex thus often matched in step 1 of the four-stage backoff strategy, preventing WebDict contributions from being used. With the WebDict tralex outperforming the STRAND tralex, this factor could hurt our results. As an alternative to merging, therefore, we also tried a simple backoff strategy in which we used the original WebDict tralex with the four-stage back-off strategy described above, to which we added a fifth stage in the event that fewer than two WebDict tralex matches were found: 5. Match the surface form of a document term to surface forms of French terms in the STRAND tralex.</Paragraph> <Paragraph position="1"> We used a threshold of N =2for this experiment run.</Paragraph> </Section> </Section> class="xml-element"></Paper>