File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1402_intro.xml

Size: 4,402 bytes

Last Modified: 2025-10-06 14:01:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1402">
  <Title>Overcoming the customization bottleneck using example-based MT</Title>
  <Section position="3" start_page="0" end_page="1" type="intro">
    <SectionTitle>
2 MSR-MT
</SectionTitle>
    <Paragraph position="0"> MSR-MT is a data-driven hybrid MT system, combining rule-based analysis and generation components with example-based transfer. The automatic alignment procedure used to create the example base relies on the same parser employed during analysis and also makes use of its own small set of rules for determining permissible alignments. Moderately sized bilingual dictionaries, containing only word pairs and their parts of speech, provide translation candidates for the alignment procedure and are also used as a backup source of translations during transfer. Statistical techniques supply additional translation pair candidates for alignment and identify certain multi-word terms for parsing and transfer.</Paragraph>
    <Paragraph position="1"> The robust, broad-coverage parsers used by MSR-MT were created originally for monolingual applications and have been used in commercial grammar checkers.</Paragraph>
    <Paragraph position="2">  These parsers produce a logical form (LF) representation that is compatible across multiple languages (see section 3 below). Parsers now exist for seven languages (English, French, German, Spanish, Chinese, Japanese, and Korean), and active development continues to improve their accuracy and coverage.</Paragraph>
    <Paragraph position="3">  Parsers for English, Spanish, French, and German provide linguistic analyses for the grammar checker in Microsoft Word.</Paragraph>
    <Paragraph position="4">  Generation components are currently being developed for English, Spanish, Chinese, and Japanese. Given the automated learning techniques used to create MSR-MT transfer components, it should theoretically be possible, provided with appropriate aligned bilingual corpora, to create MT systems for any language pair for which we have the necessary parsing and generation components. In practice, we have thus far created systems that translate into English from all other languages and that translate from English to Spanish, Chinese, and Japanese. We have experimented only preliminarily with Korean and Chinese to Japanese.</Paragraph>
    <Paragraph position="5"> Results from our Spanish-English and English-Spanish systems are reported at the end of this paper. The bilingual corpus used to produce these systems comes from Microsoft manuals and help text. The sentence alignment of this corpus is the result of using a commercial translation memory (TM) tool during the translation process.</Paragraph>
    <Paragraph position="6"> The architecture of MSR-MT is presented in  target sentences from the aligned bilingual corpus are parsed to produce corresponding LFs. The normalized word forms resulting from parsing are also fed to a statistical word association learner (described in section 4.1), which outputs learned single word translation pairs as well as a special class of multi-word pairs. The LFs are then aligned with the aid of translations from a bilingual dictionary and the learned single word pairs (section 4.2). Transfer mappings that result from LF alignment, in the form of linked source and target LF segments, are stored in a special repository known as MindNet (section 4.3). Additionally, the learned multi-word pairs are added to the bilingual dictionary for possible backup use during translation and to the main parsing lexicon to improve parse quality in certain cases.</Paragraph>
    <Paragraph position="7"> At runtime, MSR-MT's analysis parses source sentences with the same parser used for source text during the training phase (section 5.1). The resulting LFs then undergo a process known as MindMeld, which matches them against the LF transfer mappings stored in MindNet (section 5.2). MindMeld also links segments of source LFs with corresponding target LF segments stored in MindNet. These target LF segments are stitched together into a single target LF during transfer, and any translations for words or phrases not found during MindMeld are searched for in the updated bilingual dictionary and inserted in the target LF (section 5.3). Generation receives the target LF as input, from which it produces a target sentence (section 5.4).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML