File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-4004_metho.xml

Size: 7,236 bytes

Last Modified: 2025-10-06 14:10:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-4004">
  <Title>MTTK: An Alignment Toolkit for Statistical Machine Translation</Title>
  <Section position="3" start_page="265" end_page="266" type="metho">
    <SectionTitle>
2 MTTK Components
</SectionTitle>
    <Paragraph position="0"> MTTK is a collection of C++ programs and Perl and shell scripts that can be used to build statistical alignment models from parallel text. Respective of the text to be aligned, MTTK's functions are categorized into the following two main parts.</Paragraph>
    <Section position="1" start_page="265" end_page="265" type="sub_section">
      <SectionTitle>
2.1 Chunk Alignment
</SectionTitle>
      <Paragraph position="0"> Chunk alignment aims to extract sentence or sub-sentence pairs from parallel corpora. A chunk can be multiple sentences, a sentence or a subsentence, as required by the application. Two alignment procedures are implemented: one is the widely used dynamic programming procedure that derives monotone alignment of sentence segments (Gale and Church, 1991); the other is divisive clustering procedure that begins by finding coarse alignments that are then iteratively refined by successive binary splitting (Deng et al., ). These two types of alignment procedures complement each other. They can be used together to improve the overall sentence alignment quality.</Paragraph>
      <Paragraph position="1"> When translation lexicons are not available, chunk alignment can be performed using length-based statistics. This usually can serve as a starting point of sentence alignment. Alignment quality can be further improved when the chunking procedure is based on translation lexicons from IBM Model-1 alignment model (Brown et al., 1993). The MTTK toolkit also generates alignment score for each chunk pair, that can be utilized in post processing, for example in filtering out aligned segments of dubious quality.</Paragraph>
    </Section>
    <Section position="2" start_page="265" end_page="266" type="sub_section">
      <SectionTitle>
2.2 Word and Phrase Alignment
</SectionTitle>
      <Paragraph position="0"> After a collection of sentence or sub-sentence pairs are extracted via chunk alignment procedures, statistical word and phrase alignment models can be estimated with EM algorithms. MTTK provides implementations of various alignment, models including IBM Model-1, Model-2 (Brown et al., 1993), HMM-based word-to-word alignment model (Vogel et al., 1996; Och and Ney, 2003) and HMM-based word-to-phrase alignment model (Deng and Byrne, 2005). After model parameters are estimated, the Viterbi word alignments can be derived. A novel computation performed by MTTK is the genera- null tion of model-based phrase pair posterior distributions (Deng and Byrne, 2005), which plays an important role in extracting a phrase-to-phrase translation probabilities.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="266" end_page="266" type="metho">
    <SectionTitle>
3 MTTK Features
</SectionTitle>
    <Paragraph position="0"> MTTK is designed to process huge amounts of parallel text. Model parameter estimation can be carried out parallel during EM training using multiple CPUs. The entire parallel text is split into parts.</Paragraph>
    <Paragraph position="1"> During each E-step , statistics are collected parallel over each part, while in the M-steps, these statistics are merged together to update model parameters for next iteration. This parallel implementation not only reduces model training time significantly, it also avoids memory usage issues that arise in processing millions of sentence pairs, since each E-Step need only save and process co-occurrence that appears in its part of the parallel text. This enables building a single model from many millions of sentence pairs.</Paragraph>
    <Paragraph position="2"> Another feature of MTTK is language independence. Linguistic knowledge is not required during model training, although when it is available, performance can be improved. Statistical parameters are estimated and learned automatically from data in an unsupervised way. To accommodate language diversity, there are several parameters in MTTK that can be tuned for individual applications to optimize performance.</Paragraph>
  </Section>
  <Section position="5" start_page="266" end_page="266" type="metho">
    <SectionTitle>
4 A Typical Application of MTTK in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="266" end_page="266" type="sub_section">
      <SectionTitle>
Parallel Text Alignment
</SectionTitle>
      <Paragraph position="0"> A typical example of using MTTK is give in Figure 2. It starts with a collection of document pairs.</Paragraph>
      <Paragraph position="1"> During pre-processing, documents are normalized and tokenized into token sequences. This preprocessing is carried out before using the MTTK, and is usually language dependent, requiring, for example,segmentingChinesecharactersintowordsorap- null plying morphological analyzing to Arabic word sequences. null Statistical models are then built from scratch.</Paragraph>
      <Paragraph position="2"> Chunk alignment begins with length statistics that can be simply obtained by counting the number of tokens on in each language. The chunk aligning procedure then applies dynamic programming to derive a sentence alignment. After sorting the generated sentence pairs by their probabilities, high quality sentence pairs are then selected and used to train a translation lexicon. As an input for next round chunk alignment, more and better sentence pairs can be extracted and serve as training material for a bettertranslationlexicon. Thisbootstrappingprocedure identifies high quality sentence pairs in an iterative fashion.</Paragraph>
      <Paragraph position="3"> To maximize the number of training words for building word and phrase alignment models, long sentence pairs are then processed further using a divisive clustering chunk procedure that derives chunk pairs at the sub-sentence level. This provides additional translation training pairs that would otherwise be discarded as being overly long.</Paragraph>
      <Paragraph position="4"> Once all usable chunk pairs are identified in the chunk alignment procedure, word alignment model training starts with IBM Model-1. Model complexity increases gradually to Model-2, and then HMM-based word-to-word alignment model, and finally to HMM-based word-to-phrase alignment model (Deng and Byrne, 2005). With these models, word alignments can be obtained using the Viterbi algorithm, and phrase pair posterior distributions can be computed in building a phrase translation table. null In published experiments we have found that MTTK generates alignments of quality comparable to those generated by GIZA++, where alignment quality is measured both directly in terms of Alignment Error Rate relative to human word alignments and indirectly through the translation performance of systems constructed from the alignments (Deng and Byrne, 2005). We have used MTTK as the basis of translation systems entered into the recent NIST Arabic-English and Chinese-English MT Evaluations as well as the TC-STAR Chinese-English MT evaluation (NIST, 2005; TC-STAR, 2005).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML