File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-2084_metho.xml
Size: 10,942 bytes
Last Modified: 2025-10-06 14:12:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-2084"> <Title>Lexical Transfer: Between a Source Rock and a Hard Target</Title> <Section position="3" start_page="411" end_page="411" type="metho"> <SectionTitle> 3. TUNING DISTORTION </SectionTitle> <Paragraph position="0"> A good methodology for testing lexical transfer must avoid the trap of &quot;tuning distortion&quot;. Tuning distortion refers to the misleading (distorted) results obtained from a machine translation system when its dictionaries and algorithms are adjusted (tuned) to a particular text. Almost any machine translation system can produce brilliant results when the same text is run through it again and again with successive tuning. The power of tuning is ~e\]l-known and has been given a name in AI research, namely, defining a mieroworld. Corresponding to this power is the well-known difficulty of expanding a microworld system to function intelligently in a macroworld.</Paragraph> <Paragraph position="1"> In a machine translation system, difficulties arise when a tuned system Js applied to a new text.</Paragraph> </Section> <Section position="4" start_page="411" end_page="411" type="metho"> <SectionTitle> 4. THE WORD LIST APPROACH </SectionTitle> <Paragraph position="0"> To avoid tuning distortion in a test of lexieal transfer, one can build a dictionary from a word list without knowing what text will be supplied later, except that it will consist of words from the word list. This approach has significant advantages over supplying an arbitrary text and upgrading the dictionaries to handle the text, because there is a conscious or unconscious tuning of the dictionary entries during the upgrade process so long as the text is available.</Paragraph> <Paragraph position="1"> In the word list approach, all the words of the text are combined with a number of misleading words which make it difficult to tell what is the subject field of the text. Then the combined words are sorted into alphabetical order and reduced to their basic forms. The alphabetic word list is supplied to the machine translation dictionary updaters and the dictionaries are stabilized.</Paragraph> <Paragraph position="2"> Then the text is provided and immediately translated without any updates to the system and without any words missing from the dictionaries.</Paragraph> <Paragraph position="3"> If one argues that this method forces the dictionary updaters to consider too many possible collocations of each word in the list, one is simply eomplaining about the difficulty of handling real text. At least this method allows realistic testing of a system BEFORE its dictionaries have reached full size. If there is a problem in the system design, it is better to find out with dictionaries of one thousand words and all their collocations than after the dictionaries contain thirty thousand words.</Paragraph> </Section> <Section position="5" start_page="411" end_page="412" type="metho"> <SectionTitle> 5. SOME RESULTS FROM THE DLT TEST </SectionTitle> <Paragraph position="0"> The DLT machine translation project is a venture of the BSO company in Utrecht. The word list approach was used to test its lexical transfer phase even before the syntactic analysis phase was complete. This was done by manually analyzing the test sentences. The four test passages included over 2000 word tokens which reduced to about 600 content word types, to which were added about 200 misleading words. The word list of about 800 words was used to build dictionaries containing thousands of entries.</Paragraph> <Paragraph position="1"> After the texts were translated by the DLT system from English to French (during the first quarter of 1987), they were compared with official versions of the texts prepared by professional human translators at the CEC. This comparison revealed that many words matched the official language versions, some were acceptable synonymns and, as expected, some words were translated inappropriately. The DLT project is be congratulated on the overall success of the experiment. The problem words to be discussed in the paper are not intented to be simply a criticism of DL'r but rather observations that may be of interest to all machine translation researchers. Some inappropriate translations would be easily corrected by detecting predictable collocations.</Paragraph> <Paragraph position="2"> In the DLT test, the collocation software was not operational. For example, computer-assisted requires a particular translation of assisted.</Paragraph> <Paragraph position="3"> Another problem is bring, which can sometimes be translated as faire venir but which is normally translated as prendre in the context of the expression bring x to y's consciousness. This requires syntactic transfer of a type the DLT project calls metataxis and which was not implemented for the test. In a recent issue of Language Monthly (December 1987, p. 7), it was reported that Peter Lau, of the Eurotra project, said, at the 1987 Aslib conference, that the real problem of machine translation is not the &quot;reduction of structural differences&quot; b~t rather the &quot;disambiguation of lexical entries&quot;. The DLT test focussed on such lexical transfer problems.</Paragraph> <Paragraph position="4"> Some words of interest from the test are: hardware, area, sheet, pratice, giving, perform, produce, schedule, concern, field, application, induced, lead, benefit, covers bachelor, courses duty, and form.</Paragraph> <Paragraph position="5"> For each of the above words, the DLT system produced a translation which was not appropriate to the context. These were not the only mistakes, but on the other hand, the DLT system translated the majority of the words (60 percent) aocept~bly, while a fourth (25 percent) were problems for one reason or another, with I|~ percent in the gray area between aceeptPS~bility and unacceptability.</Paragraph> <Paragraph position="6"> The reader is invited to consider how these words would be handled in his or her system, be it machine transl~tion, content analysis, or other natural language processing system. How would the proper distinctions be made or an appropriate translation for these words be found without being tuned to a particular text or sublanguage? Not surprisingly, the word hardware needs ~ special translation when referring to computer hardware. But in today'E~ technical documents, there can be reference to computers but also to hardware in a more general sense or in reference to tools or weapons. How can the appropriate selection be made without an enormous world model and a system which truly understands the text? (Shades of Bar-Hillel) Another example is the word area, which can be translated r~gion or pattie. However, these two options are not interchangeable and the distinction is subtle and not dependent on predictable collocations. A sheet can be a drap (on a bed), a feuille (of paper), or a lame, depending on context. Unfortunately for lexica\] transfer, the word sheet will not always be followed by a prepositional phrase indicating the composition of the sheet.</Paragraph> <Paragraph position="7"> A practice can be what a medical doctor does when treating people, what a musician does to get ready for a concert, or what is normally done in some endeavor. These three may be translated differently.</Paragraph> <Paragraph position="8"> The verb form giving can refer to a transfer of an object to someone or to a result (&quot;one plus three gives four&quot;). To perform can refer to one's normal duty or to a stage performance and may be translated differently. Likewise, to produce can translate differently, depending on whether one is talking about a pliy Or factory.</Paragraph> <Paragraph position="9"> The reader can use a standard dictionary to see the difficulties in the following words: sohedule (time table or price list), coneex~ (interest or anxiety), field (literal area of terrain or figurative field of interest), application (treatment or level of effort), induced (social or electromagnetic pressure), lead (a wire or a sales contact), benefit (advantage or government payment), cover (lid or abstract limit), duty (obligation or import tax), and course (path or aoadmemic class).</Paragraph> <Paragraph position="10"> Two of the words in the list involve an element of poetic justice. Katz and Fodor distinguished the academic degree and unmarried man readings of bachelor with markers, but did not tell DLT how to distinguish between them when the word is encountered in text. And the translation of form depends on its (~ontent.</Paragraph> </Section> <Section position="6" start_page="412" end_page="413" type="metho"> <SectionTitle> 6. THE BILINGUAL DATA BASE </SectionTitle> <Paragraph position="0"> Preliminary to the DLT test, a corpus of texts was gathered to assist in dictionary development. A portion of the corpus was kept secret and the test passages were chosen from this portion.</Paragraph> <Paragraph position="1"> A larger portion was made available to the DLT project for lexical studies.</Paragraph> <Paragraph position="2"> The Waterloo concordance system was Used to generate KWIC listings for the lexicographers to observe the various uses of words in actual texts.</Paragraph> <Paragraph position="3"> The bilingual data base used in the test was derived from public domain documents of the CEC and the United Nations, to avoid copyright problems.</Paragraph> <Paragraph position="4"> It consists of twenty documents, each with an English and a French version, on subjects ranging from migrant workers to the ESA spacelab to the automobile industry to agriculture. The documents were first scanned using a Kurzweil OCR device. Then the disk files were hand-edited into 400 small synchronized files, 200 English and 200 French, representing a total of about eight megabytes of data (i.e. over one million words). As of this writing, the small files are being further proofed against the original documents, and the paragraphs or other logical units of the texts are being synchronized by editing in segment number marks. These marks are used by a simple preprocessing program to produce synchronized two-column bilingual output for indexing by WordCruncher, a new dynamic concordance system which has become available since the project began. This two-column format will facilitate the study of all occurrences of words or expressions with the other language segment automatically displayed to allow the researcher to quickly see how that word or expression was translated in the corpus.</Paragraph> <Paragraph position="5"> The edited corpus is available with permission of BSO for a modest fee to qualified researchers. It is hoped that the corpus, the word list methodology, and the results of the DLT test will be of use to others in the machine translation community.</Paragraph> </Section> class="xml-element"></Paper>