File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1050_intro.xml
Size: 17,601 bytes
Last Modified: 2025-10-06 14:06:14
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1050"> <Title>Semi-Automatic Acquisition of Domain-Specific Translation Lexicons</Title> <Section position="4" start_page="340" end_page="343" type="intro"> <SectionTitle> 2 SABLE </SectionTitle> <Paragraph position="0"> SABLE (Scalable Architecture for Bilingual LExicography) is a turn-key system for producing clean broad-coverage translation lexicons from raw, unaligned parallel texts (bitexts). Its design is modular and minimizes the need for language-specific components, with no dependence on genre or word order similarity, nor sentence boundaries or other &quot;anchors&quot; in the input.</Paragraph> <Paragraph position="1"> SABLE was designed with the following features in mind: * Independence from linguistic resources: SABLE does not rely on any language-specific resources other than tokenizers and a heuristic for identifying word pairs that are mutual translations, though users can easily reconfigure the system to take advantage of such resources as language-specific stemmers, part-of-speech taggers, and stop lists when they are available.</Paragraph> <Paragraph position="2"> * Black box functionality: Automatic acquisition of translation lexicons requires only that the user provide the input bitexts and identify the two languages involved.</Paragraph> <Paragraph position="3"> Robustness: The system performs well even in the face of omissions or inversions in transla- null tions.</Paragraph> <Paragraph position="4"> * Scalability: SABLE has been used successfully on input bitexts larger than 130MB.</Paragraph> <Paragraph position="5"> * Portability: SABLE was initially implemented for French/English, then ported to Spanish/English and to Korean/English. The porting process has been standardized and documented (Melamed, 1996c).</Paragraph> <Paragraph position="6"> The following is a brief description of SABLE's main components. A more detailed description of the entire system is available in (Melamed, 1997).</Paragraph> <Section position="1" start_page="340" end_page="341" type="sub_section"> <SectionTitle> 2.1 Mapping Bitext Correspondence </SectionTitle> <Paragraph position="0"> After both halves of the input bitext(s) have been tokenized, SABLE invokes the Smooth Injective Map Recognizer (SIMR) algorithm (Melamed, 1996a) and related components to produce a bitext map. A bi-text map is an injective partial function between the character positions in the two halves of the bitext.</Paragraph> <Paragraph position="1"> Each point of correspondence (x,y) in the bitext map indicates that the word centered around character position x in the first half of the bitext is a translation of the word centered around character position y in the second half. SIMR produces bitext maps a few points at a time, by interleaving a point generation phase and a point selection phase.</Paragraph> <Paragraph position="2"> SIMR is equipped with several &quot;plug-in&quot; matching heuristic modules which are based on cognates (Davis et al., 1995; Simard et al., 1992; Melamed, 1995) and/or &quot;seed&quot; translation lexicons (Chen, 1993). Correspondence points are generated using a subset of these matching heuristics; the particular subset depends on the language pair and the available resources. The matching heuristics all work at the word level, which is a happy medium between larger text units like sentences and smaller text units like character n-grams. Algorithms that map bitext correspondence at the phrase or sentences level are limited in their applicability to bitexts that have easily recognizable phrase or sentence boundaries, and Church (1993) reports that such bitexts are far more rare than one might expect. Moreover, even when these larger text units can be found, their size imposes an upper bound on the resolution of the bitext map. On the other end of the spectrum, character-based bitext mapping algorithms (Church, 1993; Davis et al., 1995) are limited to language pairs where cognates are common; in addition, they may easily be misled by superficial differences in formatting and page layout and must sacrifice precision to be computationally tractable.</Paragraph> <Paragraph position="3"> SIMR filters candidate points of correspondence using a geometric pattern recognition algorithm.</Paragraph> <Paragraph position="4"> The recognized patterns may contain non-monotonic sequences of points of correspondence, to account for</Paragraph> <Paragraph position="6"> lie between the dashed boundaries count as coocctlr'rence8. null word order differences between languages. The filtering algorithm can be efficiently interleaved with the point generation algorithm so that SIMR runs in linear time and space with respect to the size of the input bitext.</Paragraph> </Section> <Section position="2" start_page="341" end_page="341" type="sub_section"> <SectionTitle> 2.2 Translation Lexicon Extraction </SectionTitle> <Paragraph position="0"> Since bitext maps can represent crossing correspondences, they are more general than &quot;alignments&quot; (Melamed, 1996a). For the same reason, bitext maps allow a more general definition of token cooccurrence. Early efforts at extracting translation lexicons from bitexts deemed two tokens to co-occur if they occurred in aligned sentence pairs (Gale and Church, 1991). SABLE counts two tokens as co-occurring if their point of correspondence lies within a short distance 8 of the interpolated bitext map in the bitext space, as illustrated in Figure 1. To ensure that interpolation is well-defined, minimal sets of non-monotonic points of correspondence are replaced by the lower left and upper right corners of their minimum enclosing rectangles (MERs).</Paragraph> <Paragraph position="1"> SABLE uses token co-occurrence statistics to induce an initial translation lexicon, using the method described in (Melamed, 1995). The iterative filtering module then alternates between estimating the most likely translations among word tokens in the bitext and estimating the most likely translations between word types. This re-estimation paradigm was pioneered by Brown et al. (1993). However, their models were not designed for human inspection, and though some have tried, it is not clear how to extract translation lexicons from their models (Wu and Xia, 1995). In contrast, SABLE automatically constructs an explicit translation lexicon, the lexicon consisting SABLE exhibit plateaus of likelihood.</Paragraph> <Paragraph position="2"> of word type pairs that are not filtered out during the re-estimation cycle. Neither of the translation lexicon construction modules pay any attention to word order, so they work equally well for language pairs with different word order.</Paragraph> </Section> <Section position="3" start_page="341" end_page="341" type="sub_section"> <SectionTitle> 2.3 Thresholding </SectionTitle> <Paragraph position="0"> Translation lexicon recall can be automatically computed with respect to the input bitext (Melamed, 1996b), so SABLE users have the option of specifying the recall they desire in the output. As always, there is a tradeoff between recall and precision; by default, SABLE will choose a likelihood threshold that is known to produce reasonably high precision.</Paragraph> <Paragraph position="1"> 3 Evaluation in a Technical Domain</Paragraph> </Section> <Section position="4" start_page="341" end_page="342" type="sub_section"> <SectionTitle> 3.1 Materials Evaluated </SectionTitle> <Paragraph position="0"> The SABLE system was run on a corpus comprising parallel versions of Sun Microsystems documentation (&quot;Answerbooks&quot;) in French (219,158 words) and English (191,162 words). As Melamed (1996b) observes, SABLE's output groups naturally according to &quot;plateaus&quot; of likelihood (see Figure 2). The translation lexicon obtained by running SABLE on the Answerbooks contained 6663 French-English content-word entries on the 2nd plateau or higher, including 5464 on the 3rd plateau or higher. Table 1 shows a sample of 20 entries selected at random from the Answerbook corpus output on the 3rd plateau and higher. Exact matches, such as cpio/cpio or clock/clock, comprised roughly 18% of the system's output.</Paragraph> <Paragraph position="1"> In order to eliminate likely general usage entries from the initial translation lexicon, we automatically filtered out all entries that appeared in a French-English machine-readable dictionary (MRD) (Cousin, Allain, and Love, 1991). 4071 entries remained on or above the 2nd likelihood plateau, including 3135 on the 3rd likelihood plateau or higher. In previous experiments on the Hansard corpus of Canadian parliamentary proceedings, SABLE had uncovered valid general usage entries that were not present in the Collins MRD (e.g. pointillds/dotted).</Paragraph> <Paragraph position="2"> Since entries obtained from the Hansard corpus are unlikely to include relevant technical terms, we decided to test the efficacy of a second filtering step, deleting all entries that had also been obtained by running SABLE on the Hansards. On the 2nd plateau or higher, 3030 entries passed both the Collins and the Hansard filters; 2224 remained on or above the 3rd plateau.</Paragraph> <Paragraph position="3"> Thus in total, we evaluated four lexicons derived from all combinations of two independent variables: cutoff (after the 2nd plateau vs. after the 3rd plateau) and Hansards filter (with filter vs. without). Evaluations were performed on a random sample of 100 entries from each lexicon variation, interleaving the four samples to obscure any possible regularities. Thus from the evaluator's perspective the task appeared to involve a single sample of 400 translation lexicon entries.</Paragraph> </Section> <Section position="5" start_page="342" end_page="343" type="sub_section"> <SectionTitle> 3.2 Evaluation Procedure </SectionTitle> <Paragraph position="0"> Our assessment of the system was designed to reasonably approximate the post-processing that would be done in order to use this system for acquisition of translation lexicons in a real-world setting, which would necessarily involve subjective judgments. We hired six fluent speakers of both French and English at the University of Maryland; they were briefed on the general nature of the task, and given a data sheet containing the 400 candidate entries (pairs containing one French word and one English word) and a &quot;multiple choice&quot; style format for the annotations, along with the following instructions.</Paragraph> <Paragraph position="1"> 1. If the pair clearly cannot be of help in constructing a glossary, circle &quot;Invalid&quot; and go on to the next pair.</Paragraph> <Paragraph position="2"> 2. If the pair can be of help in constructing a glossary, choose one of the following: 1 V: The two words are of the &quot;plain vanilla&quot; type you might find in a bilingual dictionary.</Paragraph> <Paragraph position="3"> P: The pair is a case where a word changes its part of speech during translation. For example, &quot;to have protection&quot; in English is often translated as %tre prot6g6&quot; in Canadian parliamentary proceedings, so for that domain the pair protection/prot6g6 would be marked P.</Paragraph> <Paragraph position="4"> I: The pair is a case where a direct translation is incomplete because the computer program only looked at single words. For example, if French &quot;imm6diatement&quot; were paired with English &quot;right&quot;, you could select I because the pair is almost certainly the computer's best but incomplete attempt to be pairing &quot;imm4diatement&quot; with &quot;right away&quot;.</Paragraph> <Paragraph position="5"> 3. Then choose one or both of the following: * Specific. Leaving aside the relation null ship between the two words (your choice of P, V, or I), the word pair would be of use in constructing a technical glossary.</Paragraph> <Paragraph position="6"> * General. Leaving aside the relationship between the two words (your choice of P, V, or I), the word pair would be of use in constructing a general usage glossary.</Paragraph> <Paragraph position="7"> Notice that a word pair could make sense in both. For example, &quot;corbeille/wastebasket&quot; makes sense in the computer domain (in many popular graphical interfaces there is a wastebasket icon that is used for deleting files), but also in more general usage. So in this case you could in fact decide to choose both &quot;Specific&quot; and &quot;General&quot;. If you can't choose either &quot;Specific&quot; or &quot;General', chances are that you should reconsider whether or not to mark this word pair &quot;Invalid&quot;.</Paragraph> <Paragraph position="8"> i Since part-of-speech tagging was used in the version of SABLE that produced the candidates in this experiment, entries presented to the annotator also included a minimal form of part-of-speech information, e.g. distinguishing nouns from verbs. The annotator was informed that these annotations were the computer's best attempt to identify the part-of-speech for the words; it was suggested that they could be used as a hint as to why that word pair had been proposed, if so desired, and otherwise ignored.</Paragraph> <Paragraph position="9"> 4. If you're completely at a loss to decide whether or not the word pair is valid, just put a slash through the number of the example (the number at the beginning of the line) and go on to the next pair.</Paragraph> <Paragraph position="10"> Annotators also had the option of working electronically rather than on hardcopy.</Paragraph> <Paragraph position="11"> The assessment questionnaire was designed to elicit information primarily of two kinds. First, we were concerned with the overall accuracy of the method; that is, its ability to produce reasonable candidate entries whether they be general or domain specific. The &quot;Invalid&quot; category captures the system's mistakes on this dimension. We also explicitly annotated candidates that might be useful in constructing a translation lexicon, but possibly require further elaboration. The V category captures cases that require minimal or no additional effort, and the P category covers cases where some additional work might need to be done to accommodate the part-of-speech divergence, depending on the application. The I category captures cases where the correspondence that has been identified may not apply directly at the single-world level, but nonetheless does capture potentially useful information. Daille et al. (1994) also note the existence of &quot;incomplete&quot; cases in their results, but collapse them together with &quot;wrong&quot; pairings.</Paragraph> <Paragraph position="12"> Second, we were concerned with domain specificity. Ultimately we intend to measure this in an objective, quantitative way by comparing term usage across corpora; however, for this study we relied on human judgments.</Paragraph> </Section> <Section position="6" start_page="343" end_page="343" type="sub_section"> <SectionTitle> 3.3 Use of Context </SectionTitle> <Paragraph position="0"> Melamed (1996b) suggests that evaluation of translation lexicons requires that judges have access to bilingual concordances showing the contexts in which proposed word pairs appear; however, out-of-context judgments would be easier to obtain in both experimental and real-world settings. In a preliminary evaluation, we had three annotators (one professional French/English translator and two graduate students at the University of Pennsylvania) perform a version of the annotation task just described: they annotated a set of entries containing the output of an earlier version of the SABLE system (one that used aligned sub-sentence fragments to define term co-occurrence; cf. Section 2.2). No bilingual concordances were made available to them.</Paragraph> <Paragraph position="1"> Analysis of the system's performance in this pilot study, however, as well as annotator comments in a post-study questionnaire, confirmed that context is quite important. In order to quantify its im: portance, we asked one of the pilot annotators to repeat the evaluation on the same items, this time giving her access to context in the form of the bilingual concordances for each term pair. These concordances contained up to the first ten instances of that pair as used in context. For example, given the pair d@lacez/drag, one instance in that pair's bilingual concordance would be: Maintenez SELECT enfoncd et d~placez le dossier vers l' espace de travail .</Paragraph> <Paragraph position="2"> Press SELECT and drag the folder onto the workspace background .</Paragraph> <Paragraph position="3"> The instructions for the in-context evaluation specify that the annotator should look at the context for every word pair, pointing out that &quot;word pairs may be used in unexpected ways in technical text and words you would not normally expect to be related sometimes turn out to be related in a technical context.&quot; Although we have data from only one annotator, Table 2 shows the clear differences between the two results. 2 In light of the results of the pilot study, therefore, our six annotators were given access to bilingual concordances for the entries they were judging and instructed in their use as just described. null</Paragraph> </Section> </Section> class="xml-element"></Paper>