File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1059_intro.xml

Size: 4,753 bytes

Last Modified: 2025-10-06 14:02:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1059">
  <Title>Language Model Adaptation for Statistical Machine Translation with Structured Query Models</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 LM Adaptation via Sentence Retrieval
</SectionTitle>
    <Paragraph position="0"> Our language model adaptation is an unsupervised data augmentation approach guided by query models. Given a baseline statistical machine translation system, the language model adaptation is done in several steps shown as follows: null Generate a set of initial translation</Paragraph>
    <Paragraph position="2"> } for source sentences s, using either the baseline MT system with the background language model or only the translation model null Use H to build query null Use query to retrieve relevant sentences from the large corpus null Build specific language models from retrieved sentences null Interpolate the specific language model with the background language null Re-translate sentences s with adapted language model Figure-1: Adaptation Algorithm The specific language model</Paragraph>
    <Paragraph position="4"> The interpolation factor l can be simply estimated using cross validation or a grid search. As an alternative to using translations for the baseline system, we will also describe an approach, which uses partial translations of the source sentence, using the translation model only. In this case, no full translation needs to be carried out in the first step; only information from the translation model is used.</Paragraph>
    <Paragraph position="5"> Our approach focuses on query model building, using different levels of knowledge representations from the hypothesis set or from the translation model itself. The quality of the query models is crucial to the adapted language model's performance. Three bag-of-words query models are proposed and explained in the following sections.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Sentence Retrieval Process
</SectionTitle>
      <Paragraph position="0"> In our sentence retrieval process, the standard tf/idf (term frequency and inverse document frequency) term weighting scheme is used. The queries are built from the translation hypotheses. We follow (Eck, et al., 2004) in considering each sentence in the monolingual corpus as a document, as they have shown that this gives better results compared to retrieving entire news stories.</Paragraph>
      <Paragraph position="1"> Both the query and the sentences in the text corpus are converted into vectors by assigning a term weight to each word. Then the cosine similarity is calculated proportional to the inner product of the two vectors. All sentences are ranked according to their similarity with the query, and the most similar sentences are used as the data for building the specific language model. In our experiments we use different numbers of similar sentences, ranting from one to several thousand.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Bag-of-words Query Models
</SectionTitle>
      <Paragraph position="0"> Different query models are designed to guide the data augmentation efficiently. We first define &amp;quot;bag-of-words&amp;quot; models, based on different levels of knowledge collected from the hypotheses of the statistical machine translation engine.</Paragraph>
      <Paragraph position="1">  The first-best hypothesis is the Viterbi path in the search space returned from the statistical machine translation decoder. It is the optimal hypothesis the statistical machine translation system can generate using the given translation and language model, and restricted by the applied pruning strategy. Ignoring word order, the hypothesis is converted into a bag-of-words representation, which is then used as a query:</Paragraph>
      <Paragraph position="3"> occurrence in the hypothesis.</Paragraph>
      <Paragraph position="4"> The first-best hypothesis is the actual translation we want to improve, and usually it captures enough correct word translations to secure a sound adaptation process. But it can miss some informative translation words, which could lead to better-adapted language models.</Paragraph>
      <Paragraph position="5">  Similar to the first-best hypothesis, the n-best hypothesis list is converted into a bag-of-words representation. Words which occurred in several translation hypotheses are simply repeated in the bag-of-words representations.</Paragraph>
      <Paragraph position="7"> is the combined vocabulary from all n-best hypotheses and</Paragraph>
      <Paragraph position="9"> occurrence in the n-best hypothesis list.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML