File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1402_metho.xml

Size: 19,423 bytes

Last Modified: 2025-10-06 14:07:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1402">
  <Title>Overcoming the customization bottleneck using example-based MT</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Logical form
</SectionTitle>
    <Paragraph position="0"> MSR-MT's broad-coverage parsers produce conventional phrase structure analyses augmented with grammatical relations (Heidorn et al. 2000). Syntactic analyses undergo further processing in order to derive logical forms (LFs), which are graph structures that describe labeled dependencies among content words in the original input. LFs normalize certain syntactic alternations (e.g. active/passive) and resolve both intrasentential anaphora and long-distance dependencies.</Paragraph>
    <Paragraph position="1"> MT has proven to be an excellent application for driving the development of our LF representation. The code that builds LFs from syntactic analyses is shared across all seven of the languages under development. This shared architecture greatly simplifies the task of aligning LF segments (section 4.2) from different languages, since superficially distinct constructions in two languages frequently collapse onto similar or identical LF representations. Even when two aligned sentences produce divergent LFs, the alignment and generation components can count on a consistent interpretation of the representational machinery used to build the two. Thus the meaning of the relation Topic, for instance, is consistent across all seven languages, although its surface realizations in the various languages vary dramatically.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Training MSR-MT
</SectionTitle>
    <Paragraph position="0"> This section describes the two primary mechanisms used by MSR-MT to automatically extract translation mappings from parallel corpora and the repository in which they are stored.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Statistical learning of single word-
</SectionTitle>
      <Paragraph position="0"> and multi-word associations The software domain that has been our primary research focus contains many words and phrases that are not included in our general-domain lexicons. Identifying translation correspondences between these unknown words and phrases across an aligned dataset can provide crucial lexical anchors for the alignment algorithm described in section 4.2.</Paragraph>
      <Paragraph position="1"> In order to identify these associations, source and target text are first parsed, and normalized word forms (lemmas) are extracted. In the multi-word case, English &amp;quot;captoid&amp;quot; processing is exploited to identify sequences of related, capitalized words. Both single word and multi-word associations are iteratively hypothesized and scored by the algorithm under certain constraints until a reliable set of each is obtained.</Paragraph>
      <Paragraph position="2"> Over the English/Spanish bilingual corpus used for the present work, 9,563 single word and 4,884 multi-word associations not already known to our system were identified using this method.</Paragraph>
      <Paragraph position="3"> Moore (2001) describes this technique in detail, while Pinkham &amp; Corston-Oliver (2001) describes its integration with MSR-MT and investigates its effect on translation quality.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 Logical form alignment
</SectionTitle>
      <Paragraph position="0"> As described in section 2, MSR-MT acquires transfer mappings by aligning pairs of LFs obtained from parsing sentence pairs in a bilingual corpus. The LF alignment algorithm first establishes tentative lexical correspondences between nodes in the source and target LFs using translation pairs from a bilingual lexicon. Our English/Spanish lexicon presently contains 88,500 translation pairs, whicharethenaugmentedwithsingleword translations acquired using the statistical method describedinsection4.1.Afterestablishing possible correspondences, the algorithm uses a small set of alignment grammar rules to align LF nodes according to both lexical and structural considerations and to create LF transfer mappings. The final step is to filter the mappings based on the frequency of their source and target sides. Menezes &amp; Richardson (2001) provides further details and an evaluation of the LF alignment algorithm.</Paragraph>
      <Paragraph position="1"> The English/Spanish bilingual training corpus, consisting largely of Microsoft manuals and help text, averaged 14.1 words per English sentence. A 2.5 million word sample of English data contained almost 40K unique word forms.</Paragraph>
      <Paragraph position="2"> The data was arbitrarily split in two for use in our Spanish-English and English-Spanish systems. The first sub-corpus contains over 208,000 sentence pairs and the second over 183,000 sentence pairs. Only pairs for which both Spanish and English parsers produce complete, spanning parses and LFs are currently used for alignment. Table 1 provides the number of pairs used and the number of transfer mappings extracted and used in each case.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
LF alignment
4.3 MindNet
</SectionTitle>
      <Paragraph position="0"> The repository into which transfer mappings from LF alignment are stored is known as MindNet. Richardson et al. (1998) describes how MindNet began as a lexical knowledge base containing LF-like structures that were produced automatically from the definitions and example sentences in machine-readable dictionaries. Later, MindNet was generalized, becoming an architecture for a class of repositories that can store and access LFs produced for a variety of expository texts, including but not limited to dictionaries, encyclopedias, and technical manuals.</Paragraph>
      <Paragraph position="1"> For MSR-MT, MindNet serves as the optimal example base, specifically designed to store and retrieve the linked source and target LF segments comprising the transfer mappings extracted during LF alignment. As part of daily regression testing for MSR-MT, all the sentence pairs in the combined English/Spanish corpus are parsed, the resulting spanning LFs are aligned, and a separate MindNet for each of the two directed language pairs is built from the LF transfer mappings obtained. These MindNets are about 7MB each in size and take roughly 6.5 hourseachtocreateona550MhzPC.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
5 Running MSR-MT
</SectionTitle>
    <Paragraph position="0"> MSR-MT translates sentences in four processing steps, which were illustrated in Figure 1 and outlined in section 2 above. These steps are detailed using a simple example in the following sections.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.1 Analysis
</SectionTitle>
      <Paragraph position="0"> The input source sentence is parsed with the same parser used on source text during MSR-MT's training. The parser produces an LF for the sentence, as described in section 3. For the example LF in Figure 2, the Spanish input sentence is Haga clic en el boton de opcion. In English, this is literally Make click in the button of option. In fluent, translated English, it is</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 MindMeld
</SectionTitle>
      <Paragraph position="0"> The source LF produced by analysis is next matched by the MindMeld process to the source LF segments that are part of the transfer mappings stored in MindNet. Multiple transfer mappings may match portions of the source LF.</Paragraph>
      <Paragraph position="1"> MindMeld attempts to find the best set of matching transfer mappings by first searching for LF segments in MindNet that have matching lemmas, parts of speech, and other feature information. Larger (more specific) mappings are preferred to smaller (more general) mappings. In other words, transfers with context will be matched preferentially, but the system will fall back to the smaller transfers when no matching context is found. Among mappings of equal size, MindMeld prefers higher-frequency mappings. Mappings are also allowed to match overlapping portions of the source LF so long as they do not conflict in any way.</Paragraph>
      <Paragraph position="2"> After an optimal set of matching transfer mappings is found, MindMeld creates Links on nodes in the source LF to copies of the corresponding target LF segments retrieved from the mappings. Figure 3 shows the source LF for the example sentence with additional Links to target LF segments. Note that Links for multi-word mappings are represented by linking the root nodes (e.g., hacer and click)of the corresponding segments, then linking an asterisk (*) to the other source nodes participating in the multi-word mapping (e.g., usted and clic). Sublinks between corresponding individual source and target nodes of such a mapping (not shown in the figure) are also created for use during transfer.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.3 Transfer
</SectionTitle>
      <Paragraph position="0"> The responsibility of transfer is to take a linked LF from MindMeld and create a target LF that will be the basis for the target translation. This is accomplished through a top down traversal of the linked LF in which the target LF segments pointed to by Links on the source LF nodes are stitched together. When stitching together LF segments from possibly complex multi-word mappings, the sublinks set by MindMeld between individual nodes are used to determine correct attachment points for modifiers, etc.</Paragraph>
      <Paragraph position="1"> Default attachment points are used if needed.</Paragraph>
      <Paragraph position="2"> Also, a very small set of simple, general, hand-coded transfer rules (currently four for English to/from Spanish) may apply to fill current (and we hope, temporary) gaps in learned transfer mappings.</Paragraph>
      <Paragraph position="3"> In cases where no applicable transfer mapping was found during MindMeld, the nodes in the source LF and their relations are simply copied into the target LF. Default (i.e., most commonly occurring) single word translations may still be found in the MindNet for these nodes and inserted in the target LF, but if not, translations are obtained, if possible, from the same bilingual dictionary used during LF</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.4 Generation
</SectionTitle>
      <Paragraph position="0"> A rule-based generation component maps from the target LF to the target string (Aikawa et al. 2001). The generation components for the target languages currently handled by MSR-MT are application-independent, having been designed to apply to a range of tasks, including question answering, grammar checking, and translation. In its application to translation, generation has no information about the source language for a given input LF, working exclusively with the information passed to it by the transfer component. It uses this information, in conjunction with a monolingual (target language) dictionary to produce its output. One generic generation component is thus sufficient for each language.</Paragraph>
      <Paragraph position="1"> In some cases, transfer produces an unmistakably &amp;quot;non-native&amp;quot; target LF. In order to correct some of the worst of these anomalies, a small set of source-language independent rules is applied prior to generation. The need for such rules reflects deficiencies in our current data-driven learning techniques during transfer.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1" end_page="250" type="metho">
    <SectionTitle>
6 Evaluating MSR-MT
</SectionTitle>
    <Paragraph position="0"> In evaluating progress, we have found no effective alternative to the most obvious solution: periodic, blind human evaluations focused on translations of single sentences. The human raters used for these evaluations work for an independent agency and played no development role building the systems they test. Each language pair under active development is periodically subjected to the evaluation process described in this section.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
6.1 Evaluation Methodology
</SectionTitle>
      <Paragraph position="0"> For each evaluation, five to seven evaluators are asked to evaluate the same set of 200 to 250 blind test sentences. For each sentence, raters are presented with a reference sentence in the target language, which is a human translation of the corresponding source sentence. In order to maintain consistency among raters who may have different levels of fluency in the source language, raters are not shown the source sentence. Instead, they are presented with two machine-generated target translations presented in random order: one translation by the system to be evaluated (the experimental system), and another translation by a comparison system (the control system). The order of presentation of sentences is also randomized for each rater in order to eliminate any ordering effect.</Paragraph>
      <Paragraph position="1"> Raters are asked to make a three-way choice.</Paragraph>
      <Paragraph position="2"> For each sentence, raters may choose one of the two automatically translated sentences as the better translation of the (unseen) source sentence, assuming that the reference sentence represents a perfect translation, or, they may indicate that neither of the two is better. Raters are instructed to use their best judgment about the relative importance of fluency/style and accuracy/content preservation. We chose to use this simple three-way scale in order to avoid making any apriorijudgments about the relative importance of these parameters for subjective judgments of quality. The three-way scale also allows sentences to be rated on the same scale, regardless of whether the differences between output from system 1 and system 2 are substantial or negligible.</Paragraph>
      <Paragraph position="3"> The scoring system is similarly simple; each judgment by a rater is represented as 1 (sentence from experimental system judged better), 0 (neither sentence judged better), or -1 (sentence from control system judged better). For each sentence, the score is the mean of all raters' judgments; for each comparison, the score is the mean of the scores of all sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="250" type="sub_section">
      <SectionTitle>
6.2 Evaluation results
</SectionTitle>
      <Paragraph position="0"> Although work on MSR-MT encompasses a number of language pairs, we focus here on the evaluation of just two, Spanish-English and English-Spanish. Training data was held constant for each of these evaluations.</Paragraph>
      <Paragraph position="1">  sentences This table summarizes two evaluations tracking progress in MSR-MT's Spanish-English (SE) translation quality over a seven month development period. The first evaluation, with seven raters, compared a September 2000 version of the system to a December 2000 version. The second evaluation, carried out by six raters, examined progress between December 2000 and April 2001.</Paragraph>
      <Paragraph position="2"> A score of -1 would mean that raters uniformly preferred the control system, while a score of 1 would indicate that all raters preferred the comparison system for all sentences. In each of these evaluations, all raters significantly preferred the comparison, or newer, version of MSR-MT, as reflected in the mean preference scores of 0.30 and 0.28. These numbers confirm that the system made considerable progress over a relatively short time span.</Paragraph>
      <Paragraph position="3">  output of Babelfish (http://world.altavista.com/). Three separate evaluations were performed, in order to track MSR-MT's progress over seven months. The first two evaluations involved seven raters, while the third involved six. The shift in the mean preference score from -0.23 to 0.32 shows clear progress against Babelfish; by the second evaluation, raters very slightly preferred MSR-MT in this domain. By April, all six raters strongly preferred MSR-MT.  The evaluations summarized in this table compared February and April 2001 versions of MSR-MT's English-Spanish (ES) output to the output of the Lernout &amp; Hauspie (L&amp;H) ES system (http://officeupdate.lhsl.com/) for 250 source sentences. Five raters participated in the first evaluation, and six in the second.</Paragraph>
      <Paragraph position="4"> The mean preference scores show that by April, MSR-MT was strongly preferred over L&amp;H. Interestingly, though, one rater who participated in both evaluations maintained a slight but systematic preference for L&amp;H's translations. Determining which aspects of the translations might have caused this rater to behave differently from the others is a topic for future investigation.</Paragraph>
    </Section>
    <Section position="3" start_page="250" end_page="250" type="sub_section">
      <SectionTitle>
6.3 Discussion
</SectionTitle>
      <Paragraph position="0"> These results document steady progress in the quality of MSR-MT's output over a relatively short time. By April 2001, both the SE and ES versions of the system had surpassed Babelfish in translation quality for this domain. While these versions of MSR-MT are the most fully developed, the other language pairs under development are also progressing rapidly.</Paragraph>
      <Paragraph position="1"> In interpreting our results, it is important to keep in mind that MSR-MT has been customized to the test domain, while the Babelfish and Lernout &amp; Hauspie systems have not.</Paragraph>
      <Paragraph position="2">  This certainly affects our results, and  Babelfish was chosen for these comparisons only after we experimentally compared its output to that of the related Systran system augmented with its computer domain dictionary. Surprisingly, the means that our comparisons have a certain asymmetry. As our work progresses, we hope to evaluate MSR-MT against a quality bar that is perhaps more meaningful: the output of a commercial system that has been handcustomized for a specific domain.</Paragraph>
      <Paragraph position="3"> The asymmetrical nature of our comparison cuts both ways, however. Customization produces better translations, and a system that can be automatically customized has an inherent advantage over one that requires laborious manual customization. Comparing an automatically-customized version of MSR-MT to a commercial system which has undergone years of hand-customization will represent a comparison that is at least as asymmetrical as those we have presented here.</Paragraph>
      <Paragraph position="4"> We have another, more concrete, purpose in regularly evaluating our system relative to the output of systems like Babelfish and L&amp;H: these commercial systems serve as (nearly) static benchmarks that allow us to track our own progress without reference to absolute quality.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML