File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2037_intro.xml

Size: 3,491 bytes

Last Modified: 2025-10-06 14:03:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2037">
  <Title>Low-cost Enrichment of Spanish WordNet with Automatically Translated Glosses: Combining General and Specialized Models</Title>
  <Section position="3" start_page="287" end_page="288" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> Current state-of-the-art SMT systems are based on ideas borrowed from the Communication Theory field. Brown et al. (1988) suggested that MT can be statistically approximated to the transmission of information through a noisy channel. Given a sentence f = f1..fn (distorted signal), it is possible to approximate the sentence e = e1..em (original signal) which produced f. We need to estimate P(e|f), the probability that a translator produces f as a translation of e. By applying Bayes' rule it is decomposed into: P(e|f) = P(f|e)[?]P(e)P(f) .</Paragraph>
    <Paragraph position="1"> To obtain the string e which maximizes the translation probability for f, a search in the probability space must be performed. Because the denominator is independent of e, we can ignore it for the purpose of the search: e = argmaxeP(f|e) [?] P(e). This last equation devises three components in a SMT system. First, a language model that estimates P(e). Second, a translation model representing P(f|e). Last, a decoder responsible for performing the arg-max search. Language models are typically estimated from large mono-lingual corpora, translation models are built out from parallel corpora, and decoders usually perform approximate search, e.g., by using dynamic programming and beam search.</Paragraph>
    <Paragraph position="2"> However, in word-based models the modeling of the context in which the words occur is very weak. This problem is significantly alleviated by phrase-based models (Och, 2002), which represent nowadays the state-of-the-art in SMT.</Paragraph>
    <Section position="1" start_page="287" end_page="288" type="sub_section">
      <SectionTitle>
2.1 System Construction
</SectionTitle>
      <Paragraph position="0"> Fortunately, there is a number of freely available tools to build a phrase-based SMT system. We used only standard components and techniques for our basic system, which are all described below.</Paragraph>
      <Paragraph position="1"> The SRI Language Modeling Toolkit (SRILM) (Stolcke, 2002) supports creation and evaluation of a variety of language models. We build trigram language models applying linear interpolation and Kneser-Ney discounting for smoothing.</Paragraph>
      <Paragraph position="2"> In order to build phrase-based translation models, a phrase extraction must be performed on a word-aligned parallel corpus. We used the GIZA++ SMT Toolkit4 (Och and Ney, 2003) to generate word alignments We applied the phrase-extract algorithm, as described by Och (2002), on the Viterbi alignments output by GIZA++. We work with the union of source-to-target and target-to-source alignments, with no heuristic refinement. Phrases up to length five are considered. Also, phrase pairs appearing only once are discarded, and phrase pairs in which the source/target phrase was more than three times longer than the target/source phrase are ignored. Finally, phrase pairs are scored by relative frequency. Note that no smoothing is performed.</Paragraph>
      <Paragraph position="3"> Regarding the arg-max search, we used the Pharaoh beam search decoder (Koehn, 2004a), which naturally fits with the previous tools.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML