File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0305_metho.xml

Size: 13,798 bytes

Last Modified: 2025-10-06 14:08:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0305">
  <Title>Reducing Parameter Space for Word Alignment</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Word Alignment algorithm
</SectionTitle>
    <Paragraph position="0"> We use IBM Model 4 (Brown et al., 1993) as a basis for our word alignment system. The model was implemented in a public software package GIZA++ (Och and Ney, 2000). We use default parameters provided with the package, namely, it was bootstrapped from Model 1 (five iterations), HMM model (five iterations) Model 3 (two iterations) and Model 4 (four iterations).</Paragraph>
    <Paragraph position="1"> IBM Model 4 is a conditional generative model, which generates an English sentence (and a word alignment) given a foreign sentence (French or Romanian, in our experiments here). In the generative process, each English word a2 is duplicated a3 times according to the probabilities given by the fertility table a4a6a5a7a3a9a8 a2a11a10 . Each duplicated English word a2 is then translated to a French (or Romanian) word a12 according to the probabilities given by the translation table a13a14a5a15a12a16a8 a2a11a10 . The position of a12 in the French sentence is then moved from the position of a2 in the English sentence by an offset a17 . The probability of a17 is given by the distortion table a18a19a5a20a17a21a8 a22a23a5a7a2a24a10a14a25a27a26a28a5a7a12a29a10a30a10 , which is conditioned on the word classes a22a23a5a20a2a11a10 and a26a28a5a7a12a29a10 . In GIZA++, the word classes are automatically detected by a bilingual clustering algorithm.</Paragraph>
    <Paragraph position="2"> The translation table a13a14a5a15a12a16a8 a2a11a10 dominates the parameter space when the vocabulary size grows. In this paper, we focus on how to reduce the table size for a13a14a5a15a12a16a8 a2a11a10 . We apply two additional methods, lemmatization and bilingual lexicon extraction, described below. We expect two advantages by reducing the model parameter space. One is to reduce the memory usage, which allows us to use more training data. Another is to improve the data-toparameter ratio, and therefore the accuracy of the alignment. null</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Reducing the Parameter Space
</SectionTitle>
    <Paragraph position="0"> To reduce the model parameter space, we apply the following two methods. One is a rule-based word lemmatizer and another is a statistical lexical extraction algorithm. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Word Lemmatizer
</SectionTitle>
      <Paragraph position="0"> We use a word lemmatizer program (XRCE, 2003) which converts words in variant forms into the root forms. We preprocess the training and the test corpora with the lemmatizer. Figure 1 and 2 show examples of how the lemmatizer works.</Paragraph>
      <Paragraph position="1"> it would have been easy to say that these sanctions have to be followed rather than making them voluntary . it would have be easy to say that these sanction have to be follow rather than make them voluntary . il aurait 'et'e facile de dire que il faut appliquer ces sanctions `a le lieu de les rendre facultatives . il avoir ^etre facile de dire que il falloir appliquer ce sanction `a le lieu de le rendre facultatif .  this is being done to ensure that our children will receive a pension under the cpp . this be be do to ensure that we child will receive a pension under the cpp . cela permettra `a nos enfants de pouvoir b'en'eficier de le r'egime de pensions de le canada . cela permettre `a notre enfant de pouvoir b'en'eficier de le r'egime de pension de le canada .  Applying the lemmatizer reduces the parameter space for the alignment algorithm by reducing the vocabulary size. Nouns (and adjectives for French) with different gender and number forms are grouped into the same word. Verbs with different tenses (present, past, etc.) and aspects (-ing, -ed, etc.) are mapped to the same root word.</Paragraph>
      <Paragraph position="2"> In particular, French verbs have many different conjugations: Some verb variants appear only once or twice in a corpus, and the statistics for those rare words are unreliable. Thus, we expect to improve the model accuracy by treating those variants as the same word.</Paragraph>
      <Paragraph position="3"> On the other hand, there is a danger that lemmatization may lose useful information provided by the inflected form of a word. In particular, special words such as do and be may have different usage patterns for each variant (e.g., done vs. doing). In that case, lemmatization may actually hurt the performance.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Bilingual Lexical Extraction
</SectionTitle>
      <Paragraph position="0"> Another additional component we use is a bilingual lexicon extraction algorithm. We run the algorithm over the same training data, and obtain a list of word translation pairs. The extracted word-pair list is used as an additional training data for GIZA++. This will give some bias for the alignment model parameters. This does not actually reduce the parameter space, but if the bias is taken to the extreme (e.g. some of the model parameters are fixed to zero), it will reduce the parameter space in effect.</Paragraph>
      <Paragraph position="1"> For the bilingual lexicon extraction, we use a word alignment model different from IBM models. The purpose of using a different model is to extract 1-to-1 word translation pairs more reliably. The model (described below) assumes that a translation sentence pair is preprocessed, so that the pair is a sequence of content words.</Paragraph>
      <Paragraph position="2"> To select content words, we apply a part-of-speech tagger to remove non content words (such as determiners and prepositions). As the model focuses on the alignment of content words, we expect better performance than IBM models for extracting content word translation pairs.</Paragraph>
      <Paragraph position="3"> We give here a brief description of the bilingual lexicon extraction method we use. This method takes as input a parallel corpus, and produces a probabilistic bilingual lexicon. Our approach relies on the word-to-word translation lexicon obtained from parallel corpora following the method described in (Hull, 1999), which is based on the word-to-word alignment presented in (Himstra, 1996).</Paragraph>
      <Paragraph position="4"> We first represent co-occurrences between words across translations by a matrix, the rows of which represent the source language words, the columns the target language words, and the elements of the matrix the expected alignment frequencies (EAFs) for the words appearing in the corresponding row and column. Empty words are added in both languages in order to deal with words with no equivalent in the other language.</Paragraph>
      <Paragraph position="5"> The estimation of the expected alignment frequency is based on the Iterative Proportional Fitting Procedure (IPFP) presented in (Bishop et al., 1975). This iterative procedure updates the current estimate a4a9a31a33a32a35a34a36a38a37 of the EAF of source word a39 with target word a40 , using the following two-stage equations:</Paragraph>
      <Paragraph position="7"> (2) where a4 a36a20a64 and a4 a64a37 are the current estimates of the row and column marginals, a65 is a pair of aligned sentences containing words a39 and a40 , and a65</Paragraph>
      <Paragraph position="9"> a37 are the observed frequencies of words a39 and a40 in a65 . The initial estimates a4a9a31a44a66a68a67 a69a27a34 a36a38a37 are the observed frequencies of co-occurrences, obtained by considering each pair of aligned sentences and by incrementing the alignment frequencies accordingly. The sequence of updates will eventually converge and the EAFs are then normalized (by dividing each element a4 a36a38a37 by the row marginal a4 a36a55a64 ), so as to yield probabilistic translation lexicons, in which each source word is associated with a target word through a score.</Paragraph>
      <Paragraph position="10"> Using the bilingual lexicon thus obtained, we use a  simple heuristic, based on the best match criterion described in (Gaussier et al., 2000) to align lexical words within sentences. We then count how many times two given words are aligned in such a way, and normalize the counts so as to get our final probabilistic translation lexicon. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 English-French shared task
</SectionTitle>
      <Paragraph position="0"> In the English-French shared task, we experimented the effect of the word lemmatizer. Table 1 shows the results.1 1In the table, AER stands for Average Error Rate without null-aligned words, and AERn was calculated with null-aligned words. See the workshop shared-task guideline for the definition of AER. Mem is the memory requirement for running GIZA++.</Paragraph>
      <Paragraph position="1"> Due to our resource constraints, we used only a portion of the corpus provided by the shared task organizer. Most of our English-French experiments were carried out with the half (10 million) or the quarter (5 million) of the training corpus. We ran three different systems (nolem, base, and delem) with some different parameters. The system nolem is a plain GIZA++ program. We only lowercased the training and the test corpus for nolem. In base and delem, the corpus were preprocessed by the lemmatizer. In base system, the lemmatizer was applied blindly, while in delem, only rare words were applied with lemmatization.</Paragraph>
      <Paragraph position="2"> As seen in Table 1, applying the lemmatizer blindly (base) hurt the performance. We hypothesized that the lemmatizer hurts more, when the corpus size is bigger.</Paragraph>
      <Paragraph position="3"> In fact, the Trial AER was better in base-ef-56k than nolem-ef-56k. Then, we tested the performance when we lemmatized only rare words. We used word frequency threshold to decide whether to lemmatize or not. For example, delem-ef-2-280k lemmatized a word if it appeared less than twice in the training corpus. In general, the selective lemmatization (delem-ef-*-280k) works better than complete lemmatization (base-ef-280k). In some thresholds (delem-ef-a77 100,1000a78 -280k), the Test AER was slightly better than no lemmatization (nolemef-280k). However, from this experiment, it is not clear where we should set the threshold. We are now investigating this issue.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Romanian-English shared task
</SectionTitle>
      <Paragraph position="0"> In the Romanian-English shared task, we experimented how the bilingual lexicon extraction method affects the performance. Table 2 shows the results.</Paragraph>
      <Paragraph position="1"> We have three systems nolem, base, and trilex for this task. The first two systems are the same as the English-French shared task, except we use a lemmatizer only for English.2 The system trilex uses additional bilingual lexicon for training GIZA++. The lexicon was extracted by the algorithm described in 3.2. We tried different thresholds a73 a74a20a76 to decide which extracted lexicons are used. It is an estimated word translation probability given by the extraction algorithm. We also tested the effect of duplicating the additional lexicon by 2, 5, 10, or 20 times, to further bias the model parameters.</Paragraph>
      <Paragraph position="2"> As our extraction method currently assumes word lemmatization, we only compare trilex results with base systems. As seen in the Table 2, it performed better when the extracted lexicons were added to the training data (e.g., base-er-all vs. trilex-er-all-01). The lexicon duplication worked best when the duplication was only twice, i.e. duplicating additional lexicon too much hurt the performance. For the threshold a73 a74a20a76 , it worked better when it was set lower (i.e., adding more words). Due to the time constraints, we didn't test further lower thresholds. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> As we expected, the lemmatizer reduced the memory requirement, and improved the word alignment accuracy when it was applied only for infrequent words. The behavior of using different threshold to decide whether to lemmatize or not is unclear, so we are now investigating this issue.</Paragraph>
    <Paragraph position="1"> Adding extracted bilingual lexicons to the training data also showed some improvement in the alignment accuracy. Due to our experimental setup, we were unable 2We do not have a Romanian lemmatizer, but we used a part-of-speech tagger by Dan Tufis for Romanian to extract bilingual lexicon.</Paragraph>
    <Paragraph position="2"> carry this experiment with selective lemmatization. We are going to try such experiment pretty soon.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> We presented our experimental results of the workshop shared task, by using IBM model 4 as a baseline, and by using a word lemmatizer and a bilingual lexicon extraction algorithm as additional components. They showed some improvement over the baseline, and suggests the need of careful parameter settings.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML