File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-4004_intro.xml
Size: 4,184 bytes
Last Modified: 2025-10-06 14:03:29
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-4004"> <Title>MTTK: An Alignment Toolkit for Statistical Machine Translation</Title> <Section position="2" start_page="0" end_page="265" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Parallel text alignment procedures attempt to identify translation equivalences within collections of translateddocuments. Thiscanbebedoneatvarious levels. Atthefinestlevel, thisinvolvesthealignment of words and phrases within two sentences that are known to be translations (Brown et al., 1993; Och and Ney, 2003; Vogel et al., 1996; Deng and Byrne, 2005). Another task is the identification and alignment of sentence-level segments within document pairs that are known to be translations (Gale and Church, 1991); this is referred to as sentence-level alignment, although it may also involve the alignment of sub-sentential segments (Deng et al., ) as well as the identification of long segments in either document which are not translations. There is also document level translation which involves the identificationof translateddocument pairsin acollection of documents in multiple languages. As an example, Figure 1 shows parallel Chinese/English text that is aligned at the sentence, word, and phrase levels.</Paragraph> <Paragraph position="1"> Parallel text plays a crucial role in multi-lingual natural language processing research. In particular, statistical machine translation systems require collections of sentence pairs (or sentence fragment pairs) as the basic ingredients for building statistical word and phrase alignment models. However, with the increasing availability of parallel text, human-created alignments are expensive and often unaffordable for practical systems, even at a small scale. Highqualityautomaticalignmentofparalleltexthas therefore become indispensable. In addition to good alignment quality, several other properties are also desirable in automatic alignment systems. Ideally, these should be general-purpose and language independent, capable of aligning very different languages, such as English, French, Chinese, German and Arabic, to give a few examples of current interest. If the alignment system is based on statistical models, the model parameters should be estimatedfromscratch,inanunsupervisedmannerfrom null whatever parallel text is available. To process millions of sentence pairs, these models need to be capable of generalization and the alignment and estimation algorithms should be computationally efficient. Finally, since noisy mismatched text is often found in real data, such as parallel text mined from web pages, automatic alignment needs to be robust.</Paragraph> <Paragraph position="2"> There are systems available for these purposes, notably the GIZA++ (Och and Ney, 2003) toolkit and ! &quot; # $ % & ' () , * +, $ % '- , ./ 01 &2 .</Paragraph> <Paragraph position="3"> It is necessary to resolutely remove obstacles in rivers and lakes .</Paragraph> <Paragraph position="4"> 3 4 56 78 9: , ;< => .</Paragraph> <Paragraph position="5"> 4 . It is necessary to strengthen monitoring and forecast work and scientifically dispatch people and materials .</Paragraph> <Paragraph position="6"> ! ?@ AB CD , EFGH IJ 9 : K> .</Paragraph> <Paragraph position="7"> It is necessary to take effective measures and try by every possible means to provide precision forecast . L M ! NO PQ RS 9: FT , A U*V W XY() .</Paragraph> <Paragraph position="8"> Before the flood season comes , it is necessary to seize the time to formulate plans for forecasting floods and to carry out work with clear lines denote the segmentations of a sentence alignment and arrows denote a word-level mapping. the Champollion Toolkit (Ma et al., 2004).</Paragraph> <Paragraph position="9"> ThisdemointroducesMTTK,theMachineTranslation Toolkit. The toolkit can be used to train statistical models and perform parallel text alignment at different levels. Target applications include not only machine translation, but also bilingual lexicon induction, crosslingualinformationretrievalandother multi-lingual applications.</Paragraph> </Section> class="xml-element"></Paper>