File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2095_intro.xml
Size: 5,048 bytes
Last Modified: 2025-10-06 14:03:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2095"> <Title>Using comparable corpora to solve problems difficult for human translators</Title> <Section position="3" start_page="0" end_page="739" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> There is no doubt that both professional and trainee translators need access to authentic data provided by corpora. With respect to polysemous lexical items, bilingual dictionaries list several translation equivalents for a headword, but words taken in their contexts can be translated in many more ways than indicated in dictionaries. For instance, the Oxford Russian Dictionary (ORD) lacks a translation for the Russian expression a232a241a247a229a240a239a251a226a224a254a249a232a233 a238a242a226a229a242 ('comprehensive answer'), while the Multitran Russian-English dictionary suggests that it can be translated as irrefragable answer. Yet this expression is extremely rare in English; on the Internet it occurs mostly in pages produced by Russian speakers.</Paragraph> <Paragraph position="1"> On the other hand, translations for polysemous words are too numerous to be listed for all possible contexts. For example, the entry for strong in ORD already has 57 subentries and yet it fails to mention many word combinations frequent in the British National Corpus (BNC), such as strong {feeling, field, opposition, sense, voice}. Strong voice is also not listed in the Oxford French, German or Spanish Dictionaries.</Paragraph> <Paragraph position="2"> There has been surprisingly little research on computational methods for finding translation equivalents of words from the general lexicon.</Paragraph> <Paragraph position="3"> Practically all previous studies have concerned detection of terminological equivalence. For instance, project Termight at AT&T aimed to develop a tool for semi-automatic acquisition of termbanks in the computer science domain (Dagan and Church, 1997). There was also a study concerning the use of multilingual webpages to develop bilingual lexicons and termbanks (Grefenstette, 2002). However, neither of them concerned translations of words from the general lexicon. At the same time, translators often experience more difficulty in dealing with such general expressions because of their polysemy, which is reflected differently in the target language, thus causing the dependency of their translation on the corresponding context. Such variation is often not captured by dictionaries.</Paragraph> <Paragraph position="4"> Because of their importance, words from the general lexicon are studied by translation researchers, and comparable corpora are increasingly used in translation practice and training (Varantola, 2003). However, such studies are mostly confined to lexicographic exercises, which compare the contexts and functions of potential translation equivalents once they are known, for instance, absolutely vs. assolutamente in Italian (Partington, 1998). Such studies do not provide a computational model for finding appropriate translation equivalents for expressions that are not listed or are inadequate in dictionaries.</Paragraph> <Paragraph position="5"> Parallel corpora, conisting of original texts and their exact translations, provide a useful supplement to decontextualised translation equivalents listed in dictionaries. However, parallel corpora are not representative. Many of them are in the range of a few million words, which is simply too small to account for variations in translation of moderately frequent words. Those that are a bit larger, such as the Europarl corpus, are restricted in their domain. For instance, all of the 14 instances of strong voice in the English section of Europarl are used in the sense of 'the opinion of a political institution'. At the same time the BNC contains 46 instances of strong voice covering several different meanings.</Paragraph> <Paragraph position="6"> In this paper we propose a computational method for using comparable corpora to find translation equivalents for source language expressions that are considered as difficult by trainee or professional translators. The model is based on detecting frequent multi-word expressions (MWEs) in the source and target languages and finding a mapping between them in comparable monolingual corpora, which are designed in a similar way in the two languages.</Paragraph> <Paragraph position="7"> The described methodology is implemented in ASSIST, a tool that helps translators to find solutions for difficult translation problems. The tool presents the results as lists of translation suggestions (usually 50 to 100 items) ordered alphabetically or by their frequency in target language corpora. Translators can skim through these lists and identify an example which is most appropriate in a given context.</Paragraph> <Paragraph position="8"> In the following sections we outline our approach, evaluate the output of the prototype of ASSIST and discuss future work.</Paragraph> </Section> class="xml-element"></Paper>