File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0301_intro.xml
Size: 5,752 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0301"> <Title>Robust Bilingual Word Alignment for Machine Aided Translation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Aligning parallel texts has recently received considerable attention (Warwick et al., 1990; Brown et al., 1991a; Gale and Church, 1991b; Gale and Church, 1991a; Kay and Rosenschein, 1993; Simard et al., 1992; Church, 1993; Kupiec, 1993; Matsumoto et al., 1993). These methods have been used in machine translation (Brown et al., 1990; Sadler, 1989), terminology research and translation aids (Isabelle, 1992; Ogden and Gonzales, 1993), bilingual lexicography (Klavans and Tzoukermann, 1990), collocation studies (Smadja, 1992), word-sense disambiguation (Brown et al., 1991b; Gale et al., 1992) and information retrieval in a multilingual environment (Landauer and Littman, 1990).</Paragraph> <Paragraph position="1"> The information retrieval application may be of particular relevance to this audience. It would be highly desirable for users to be able to express queries in whatever language they chose and retrieve documents that may or may not have been written in the same language as the query. Landauer and Littman used SVD analysis (or Latent Semantic Indexing) on the Canadian Hansards, parliamentary debates that are published in both English and French, in order to estimate a kind of soft thesaurus. They then showed that these estimates could be used to retrieve documents appropriately in the bilingual condition where the query and the document were written in different languages. null We have been most interested in the terminology application. How does Microsoft, or some other software vendor, want &quot;dialog box,&quot; &quot;text box,&quot; and &quot;menu box&quot; to be translated in their manuals? Considerable time is spent on terminology questions, many of which have already been solved by other translators working on similar texts. It ought to be possible for a translator to point at an instance of &quot;dialog box&quot; in the English version of the Microsoft Windows manual and see how it was translated in the French version of the same manual. Alternatively, the translator can ask for a bilingual concordance as shown in Figure 1. A PC-based terminology reuse tool is being developed to do just exactly this. The tool depends crucially on the results of an alignment program to determine which parts of the source text correspond with which parts of the target text.</Paragraph> <Paragraph position="2"> In working with the translators at AT&T Language Line Services, a commercial translation service, we discovered that we needed to completely redesign our alignment programs in order to deal more effectively with texts supplied by Language Line's customers. All too often the texts are not available in electronic form, and may need to be scanned in and processed by an OCR (optical character recognition) device. Even if the texts are available in electronic form, it may not be worth the effort to clean them up by hand. Real texts are not like the bIansards; real texts are much smaller and not nearly as clean as the ideal texts that have displayed . In the Save As afficha Dana Enregistrer Enregistrer ainsi que son extension . Dana la boite de x When you choose a command button , the Lorsque commande bouton sissez un bouton de commande , la boite de ,o.</Paragraph> <Paragraph position="3"> button . Dr doubl - lick the Control r bouton cliquer lois Systeme ouvez aussi cliquer deux lois sur la case du o,.</Paragraph> <Paragraph position="4"> oo.</Paragraph> <Paragraph position="5"> ee ' aa , ' When you move to an empty Lorsque placez de Lorsque vous vous placez darts une zone de dialog box , this area is called Save dialogue boite cette zone est Enregistr dialogue Enregistrer sous, cette zone eat appele dialog box closes and the command is dialogue boite ferme commande execute dialogue se ferme et le programme execute la tom menu box . Or press ESC . If a dialog box d menu case Si dialogue boite p menu Systeme . II eat egalement possible d ' a text box , an iiisertion point ( flastung ve texte zone insertion ( texte vide , un point d ' insertion ( barre vertic for the word &quot;box&quot; are shown, identifying three different translations for the word: boite, case, zone. The concordances are selected from English and French versions of the Microsoft Windows manual (with some errors introduced by OCR). There are three lines of text for each instance of &quot;box&quot;: (1) English, (2) glosses, and (3) French. The glosses are selected from the French text (the third line), and are written underneath the corresponding English words, as identified by word_align. been used in previous studies.</Paragraph> <Paragraph position="6"> To deal with these robustness issues, Church (1993) developed a character-based alignment method called char_align. The method was intended as a replacement for sentence-based methods (e.g., (Brown et al., 1991a; Gale and Church, 1991b; Kay and Rosenschein, 1993)), which are very sensitive to noise. This paper describes a new program, called word_align, that starts with an initial &quot;rough&quot; alignment (e.g., the output of char_align or a sentence-based alignment method), and produces improved alignments by exploiting constraints at the word-level. The alignment algorithm consists of two steps: (1) estimate translation probabilities, and (2) use these probabilities to search for most probable alignment path. The two steps are described in the following section.</Paragraph> </Section> class="xml-element"></Paper>