XML Viewer - e06-2002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2002_metho.xml
Size: 11,525 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-2002">
  <Title>A Web-based Demonstrator of a Multi-lingual Phrase-based Translation System</Title>
  <Section position="3" start_page="0" end_page="91" type="metho">
    <SectionTitle>
2 SMT System Description
2.1 Log-Linear Model
</SectionTitle>
    <Paragraph position="0"> Given a string f in the source language, the goal of the statistical machine translation is to select the string e in the target language which maximizes the posterior distribution Pr(e  |f). By introducing the hidden word alignment variable a, the following approximate optimization criterion can be applied for that purpose:</Paragraph>
    <Paragraph position="2"> Exploiting the maximum entropy (Berger et al., 1996) framework, the conditional distribution Pr(e,a  |f) can be determined through suitable real valued functions (called features) hr(e,f,a),r = 1...R, and takes the parametric form:</Paragraph>
    <Paragraph position="4"> The ITC-irst system (Chen et al., 2005) is based on a log-linear model which extends the original IBM Model 4 (Brown et al., 1993) to phrases (Koehn et al., 2003; Federico and Bertoldi, 2005). In particular, target strings e are built from sequences of phrases ~e1 ... ~el. For each target phrase ~e the corresponding source phrase within the source string is identified through three random quantities: the fertility ph, which establishes its length; the permutation pii, which sets its first position; the tablet ~f, which tells its word string. Notice that target phrases might have fertility equal to zero, hence they do not translate any  source word. Moreover, uncovered source positions are associated to a special target word (null) according to specific fertility and permutation random variables.</Paragraph>
    <Paragraph position="5"> The resulting log-linear model applies eight feature functions whose parameters are either estimated from data (e.g. target language models, phrase-based lexicon models) or empirically fixed (e.g. permutation models). While feature functions exploit statistics extracted from monolingual or word-aligned texts from the training data, the scaling factors l of the log-linear model are estimated on the development data by applying a minimum error training procedure (Och, 2004).</Paragraph>
    <Section position="1" start_page="91" end_page="91" type="sub_section">
      <SectionTitle>
2.2 Decoding Strategy
</SectionTitle>
      <Paragraph position="0"> The translation of an input string is performed by the SMT system in two steps. In the first pass a beam search algorithm (decoder) computes a word graph of translation hypotheses. Hence, either the best translation hypothesis is directly extracted from the word graph and output, or an N-best list of translations is computed (Tran et al., 1996). The N-best translations are then re-ranked by applying additional features and the top ranking translation is finally output.</Paragraph>
      <Paragraph position="1"> The decoder exploits dynamic programming, that is the optimal solution is computed by expanding and recombining previously computed partial theories. A theory is described by its state which is the only information needed for its expansion. Expanded theories sharing the same state are recombined, that is only the best scoring one is stored for further expansions. In order to output a word graph of translations, backpointers to all expanded theories are mantained, too.</Paragraph>
      <Paragraph position="2"> To cope with the large number of generated theories some approximations are introduced during the search: less promising theories are pruned off (beam search) and a new source position is selected by limiting the number of vacant positions on the left-hand and the distance from the left most vacant position (re-ordering constraints).</Paragraph>
    </Section>
    <Section position="2" start_page="91" end_page="91" type="sub_section">
      <SectionTitle>
2.3 Phrase extraction and model training
</SectionTitle>
      <Paragraph position="0"> Training of the phrase-based translation model requires a parallel corpus provided with word-alignments in both directions, i.e. from source to target positions, and viceversa. This pre-processing step can be accomplished by applying the GIZA++ toolkit (Och and Ney, 2003) that provides Viterbi alignments based on IBM Model-4.</Paragraph>
      <Paragraph position="1"> Starting from the parallel training corpus, provided with direct and inverted alignments, the so-called union alignment (Och and Ney, 2003) is computed.</Paragraph>
      <Paragraph position="2"> Phrase-pairs are extracted from each sentence pair which correspond to sub-intervals of the source and target positions, J and I, such that the union alignment links all positions of J into I and all positions of I into J. In general, phrases are extracted with maximum length in the source and target defined by the parameters Jmax and Imax. All such phrase-pairs are efficiently computed by an algorithm with complexity O(lImaxJ2max) (Cettolo et al., 2005).</Paragraph>
      <Paragraph position="3"> Given all phrase-pairs extracted from the training corpus, lexicon probabilities and fertility probabilities are estimated.</Paragraph>
      <Paragraph position="4"> Target language models (LMs) used by the decoder and rescoring modules are, respectively, estimated from 3-gram and 4-gram statistics by applying the modified Kneser-Ney smoothing method (Goodman and Chen, 1998). LMs are estimated with an in-house software toolkit which also provides a compact binary representation of the LM which is used by the decoder.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="91" end_page="92" type="metho">
    <SectionTitle>
3 Demo Architecture
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the two-layer architecture of the demo. At the bottom lie the programs that provide the actual translation services: for each language-pair a wrapper coordinates the activity of a specialized pre-processing tool and a MT decoder. The translation programs run on a grid-based cluster of high-end PCs to optimize the processing speed.</Paragraph>
    <Paragraph position="1"> All the wrappers communicate with the MT front-end whose main task is to forward translation requests to the appropriate language-pair wrapper and to report an error in case of wrong requests (e.g. unsupported language-pair). It is worth noticing here that a new language-pair can be easily added to the system with a minimal intervention on the code of the MT front-end.</Paragraph>
    <Paragraph position="2"> At the top of the architecture are the programs that provide the interface with the user. This layer is separated from the translation layer (hosted by internal machines only) by means of a firewall.</Paragraph>
    <Paragraph position="3"> The user interface is implemented as a Web page in which a translation request (a source sentence and a language-pair) is input by means of an HTML form. The cgi script invocated by the form manages the interaction with the MT front-end.</Paragraph>
    <Paragraph position="4">  language-pair a set of programs (in particular the MT decoder) provides the translation service. The request issued by the user on the Web page is sent by the cgi script to the MT front-end. The translation is then performed on the appropriate language-pair service and the output sent back to the Web browser.</Paragraph>
    <Paragraph position="5"> When a user issues a translation request after filling the form fields, the cgi script sends the request to the MT front-end and waits for its reply. The input sentence is then forwarded to the wrapper of the appropriate language-pair. After a pre-processing step, the actual translation is performed by the specific MT decoder. The output in the target language is then sent back to the user's Web browser through the chain in the reverse order.</Paragraph>
    <Paragraph position="6"> From a technical point of view, the inter-process communication is realized by means of standard TCP-IP sockets. As far as the encoding of texts is concerned, all the languages are encoded in UTF8: this allows to manage the processing phase in an uniform way and to render graphically different character sets.</Paragraph>
  </Section>
  <Section position="5" start_page="92" end_page="92" type="metho">
    <SectionTitle>
4 The supported language-pairs
</SectionTitle>
    <Paragraph position="0"> Although there is no theoretical limit to the number of supported language-pairs, the current version of the demo provides translations to English from three source languages: Arabic, Chinese and Spanish. For demonstration purpose, three different application domains are covered too.</Paragraph>
    <Paragraph position="1"> Arabic-to-English (Tourism) The Arabic-to-English system has been trained with the data provided by the International Workshop on Spoken Language Translation 2005 The context is that of the Basic Traveling Expression Corpus (BTEC) task (Takezawa et al., 2002). BTEC is a multilingual speech corpus which contains sentences coming from phrase books for tourists. Training set includes 20k sentences containing 159K Arabic and 182K English running words; vocabulary size is 18K for Arabic, 7K for English.</Paragraph>
    <Paragraph position="2"> Chinese-to-English (Newswire) The Chinese-to-English system has been trained with the data provided by the NIST MT Evaluation Campaign 2005 , large-data condition. In this case parallel data are mainly news-wires provided by news agencies. Training set includes 71M Chinese and 77M English running words; vocabulary size is 157K for Chinese, 214K for English.</Paragraph>
    <Paragraph position="3"> Spanish-to-English (European Parliament) The Spanish-to-English system has been trained with the data provided by the Evaluation Campaign 2005 of the European integrated project TC-STAR4. The context is that of the speeches of the European Parliament Plenary sessions (EPPS) from April 1996 to October 2004. Training set for the Final Text Edition transcriptions includes 31M Spanish and 30M English running words; vocabulary size is 140K for Spanish, 94K for English.</Paragraph>
  </Section>
  <Section position="6" start_page="92" end_page="93" type="metho">
    <SectionTitle>
5 The Web-based Interface
</SectionTitle>
    <Paragraph position="0"> Figure 2 shows a snapshot of the Web-based interface of the demo - the URL has been removed to make this submission anonymous. In the upper part of the page the user provides the two information required for the translation: the source sentence can be input in a 80x5 textarea html structure, while the language-pair can be selected by means of a set a radio-buttons. The user can reset the input area or send the translation request by means of standard reset and submit buttons. Some examples of bilingual sentences are provided in the lower part of the page.</Paragraph>
    <Paragraph position="1">  The user provides the sentence to be translated in the desired language-pair. Some examples of bilingual sentences are also available to the user. The output of a translation request is simple: the requested source sentence, the translation in the target language and the selected language-pair are presented to the user. Figure 3 shows an example of an Arabic sentence translated into English.</Paragraph>
    <Paragraph position="2"> We plan to extend the interface with the possibility for the user to ask additional information about the translation - e.g. the number of explored theories or the score of the first-best translation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML