XML Viewer - n06-1001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1001_metho.xml
Size: 22,815 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1001">
  <Title>Capitalizing Machine Translation</Title>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
YOUR CHANGES TO /HOME/DOC .&amp;quot;, where all words
</SectionTitle>
    <Paragraph position="0"> are in all upper-case. Without looking into the case  of the MT input, we can hardly get the correct capitalization result.</Paragraph>
    <Paragraph position="1"> Although monolingual capitalization models in previous work can apply to MT output, a bilingual model is more desirable. This is because MT outputs usually strongly preserve case from the input, and because monolingual capitalization models do not always perform as well on badly translated text as on well-formed syntactic texts.</Paragraph>
    <Paragraph position="2"> In this paper, we present a bilingual capitalization model for capitalizing machine translation outputs using conditional random fields (CRFs) (Lafferty et al., 2001). This model exploits case information from both the input sentence (source) and the output sentence (target) of the MT system. We define a series of feature functions to incorporate capitalization knowledge into the model.</Paragraph>
    <Paragraph position="3"> Experimental results are shown in terms of BLEU scores of a phrase-based SMT system with the capitalization model incorporated, and in terms of capitalization precision. Experiments are performed on both French and English targeted MT systems with large-scale training data. Our experimental results show that the CRF-based bilingual capitalization model performs better than a strong baseline capitalizer that uses a trigram language model.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> A simple capitalizer is the 1-gram tagger: the case of a word is always the most frequent one observed in training data, with the exception that the sentence-initial word is always capitalized. A 1-gram capitalizer is usually used as a baseline for capitalization experiments (Lita et al., 2003; Kim and Woodland, 2004; Chelba and Acero, 2004).</Paragraph>
    <Paragraph position="1"> Lita et al. (2003) view capitalization as a lexical ambiguity resolution problem, where the lexical choices for each lowercased word happen to be its different surface forms. For a lowercased sentence e, a trigram language model is used to find the best capitalization tag sequence T that maximizes p(T,e) = p(E), resulting in a case-sensitive sentence E. Besides local trigrams, sentence-level contexts like sentence-initial position are employed as well.</Paragraph>
    <Paragraph position="2"> Chelba and Acero (2004) frame capitalization as a sequence labeling problem, where, for each low- null by most statistical MT systems.</Paragraph>
    <Paragraph position="3"> ercased sentence e, they find the label sequence T that maximizes p(Tje). They use a maximum entropy Markov model (MEMM) to combine features of words, cases and context (i.e., tag transitions). Gale et al. (1994) report good results on capitalizing 100 words. Mikheev (1999) performs capitalization using simple positional heuristics.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3 Monolingual Capitalization Scheme
</SectionTitle>
    <Paragraph position="0"> Translation and capitalization are usually performed in two successive steps because removing case information from the training of translation models substantially reduces both the source and target vocabulary sizes. Smaller vocabularies lead to a smaller translation model with fewer parameters to learn.</Paragraph>
    <Paragraph position="1"> For example, if we do not remove the case information, we will have to deal with at least nine probabilities for the English-French word pair (click, cliquez). This is because either &amp;quot;click&amp;quot; or &amp;quot;cliquez&amp;quot; can have at least three tags (IU, AL, AU), and thus three surface forms. A smaller translation model requires less training data, and can be estimated more accurately than otherwise from the same amount of training data. A smaller translation model also means less memory usage.</Paragraph>
    <Paragraph position="2"> Most statistical MT systems employ the monolingual capitalization scheme as shown in Figure 1. In this scheme, the translation model and the target language model are trained from the lowercased corpora. The capitalization model is trained from the case-sensitive target corpus. In decoding, we first turn input into lowercase, then use the decoder to generate the lowercased translation, and finally ap- null ply the capitalization model to recover the case of the decoding output.</Paragraph>
    <Paragraph position="3"> The monolingual capitalization scheme makes many errors as shown in Table 1. Each cell in the table contains the MT-input and the MT-output.</Paragraph>
    <Paragraph position="4"> These errors are due to the capitalizer does not have access to the source sentence.</Paragraph>
    <Paragraph position="5"> Regardless, estimating mixed-cased translation models, however, is a very interesting topic and worth future study.</Paragraph>
  </Section>
  <Section position="7" start_page="2" end_page="4" type="metho">
    <SectionTitle>
4 Bilingual Capitalization Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.1 The Model
</SectionTitle>
      <Paragraph position="0"> Our probabilistic bilingual capitalization model exploits case information from both the input sentence to the MT system and the output sentence from the system (see Figure 2). An MT system translates a capitalized sentence F into a lowercased sentence e.</Paragraph>
      <Paragraph position="1"> A statistical MT system can also provide the alignment A between the input F and the output e; for example, a statistical phrase-based MT system could provide the phrase boundaries in F and e, and also the alignment between the phrases.1  aries.</Paragraph>
      <Paragraph position="2"> The bilingual capitalization algorithm recovers the capitalized sentence E from e, according to the input sentence F, and the alignment A. Formally, we look for the best capitalized sentence E[?] such</Paragraph>
      <Paragraph position="4"> where GEN(e) is a function returning the set of possible capitalized sentences consistent with e. Notice that e does not appear in p(EjF,A) because we can uniquely obtain e from E. p(EjF,A) is the capitalization model of concern in this paper.2 To further decompose the capitalization model p(EjF,A), we make some assumptions. As shown in Figure 3, input sentence F, capitalized output E, and their alignment can be viewed as a graph. Vertices of the graph correspond to words in F and E. An edge connecting a word in F and a word in E corresponds to a word alignment. An edge between two words in E represents the dependency between them captured by monolingual n-gram language models. We also assume that both E and F have phrase boundaries available (denoted by the square brackets), and that A is the phrase alignment.</Paragraph>
      <Paragraph position="5"> In Figure 3, ~Fj is the j-th phrase of F, ~Ei is the i-th phrase of E, and they align to each other. We do not require a word alignment; instead we find it reasonable to think that a word in ~Ei can be aligned to any adapted to syntax-based machine translation, too. To this end, the translational correspondence is described within a translation rule, i.e., (Galley et al., 2004) (or a synchronous production), rather than a translational phrase pair; and the training data will be derivation forests, instead of the phrase-aligned bilingual corpus.</Paragraph>
      <Paragraph position="6"> 2The capitalization model p(E|F, A) itself does not require the existence of e. This means that in principle this model can also be viewed as a capitalized translation model that performs translation and capitalization in an integrated step. In our paper, however, we consider the case where the machine translation output e is given, which is reflected by the the fact that GEN(e) takes e as input in Formula 1.</Paragraph>
      <Paragraph position="7">  word in ~Fj. A probabilistic model defined on this graph is a Conditional Random Field. Therefore, it is natural to formulate the bilingual capitalization model using CRFs:3</Paragraph>
      <Paragraph position="9"> on this capitalization model, the decoder in the capitalizer looks for the best E[?] such that</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> Following Roark et al. (2004), Lafferty et al. (2001) and Chen and Rosenfeld (1999), we are looking for the set of feature weights l maximizing the regularized log-likelihood LLR(l) of the training data</Paragraph>
      <Paragraph position="2"> The second term at the right-hand side of Formula 5 is a zero-mean Gaussian prior on the parameters. s is the variance of the Gaussian prior dictating the cost of feature weights moving away from the mean -- a smaller value of s keeps feature weights closer to the mean. s can be determined by linear search on development data.4 The use of the Gaussian prior term in the objective function has been found effective in avoiding overfitting, leading to consistently better results. The choice of LLR as an objective function can be justified as maximum a-posteriori (MAP) training within a Bayesian approach (Roark et al., 2004).</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
4.3 Feature Functions
</SectionTitle>
      <Paragraph position="0"> We define features based on the alignment graph in Figure 3. Each feature function is defined on a word.</Paragraph>
      <Paragraph position="1"> Monolingual language model feature. The monolingual LM feature of word Ei is the logarithm of the probability of the n-gram ending at</Paragraph>
      <Paragraph position="3"> p should be appropriately smoothed such that it never returns zero.</Paragraph>
      <Paragraph position="4"> Capitalized translation model feature. Suppose E phrase &amp;quot;Click OK&amp;quot; is aligned to F phrase &amp;quot;Cliquez OK&amp;quot;. The capitalized translation model feature of &amp;quot;Click&amp;quot; is computed as log p(ClickjCliquez)+log p(ClickjOK). &amp;quot;Click&amp;quot; is assumed to be aligned to any word in the F phrase.</Paragraph>
      <Paragraph position="5"> The larger the probability that &amp;quot;Click&amp;quot; is translated from an F word, i.e., &amp;quot;Cliquez&amp;quot;, the more chances that &amp;quot;Click&amp;quot; preserves the case of &amp;quot;Cliquez&amp;quot;. Formally, for word Ei, and an aligned phrase pair ~El and ~Fm, where Ei 2 ~El, the capitalized translation model feature of Ei is</Paragraph>
      <Paragraph position="7"> needs smoothing to avoid returning zero, and is estimated from a word-aligned bilingual corpus.</Paragraph>
      <Paragraph position="8"> Capitalization tag translation feature. The feature value of E word &amp;quot;Click&amp;quot; aligning to F phrase &amp;quot;Cliquez OK&amp;quot; is log p(IUjIU)p(clickjcliquez) + log p(IUjAU)p(clickjok). We see that this feature is less specific than the capitalized translation model feature. It is computed in terms of the tag translation probability and the lowercased word translation probability. The lowercased word translation probability, i.e., p(clickjok), is used to decide how much of the tag translation probability, i.e., p(IUjAU), will contribute to the final decision. The smaller the word translation probability, i.e., p(clickjok), is, the smaller the chance that the surface form of &amp;quot;click&amp;quot;  preserves case from that of &amp;quot;ok&amp;quot;. Formally, this feature is defined as</Paragraph>
      <Paragraph position="10"> p(eij ~fm,k) is the t-table over lowercased word pairs, which is the usual &amp;quot;t-table&amp;quot; in a SMT system.</Paragraph>
      <Paragraph position="11"> p(t(Ei)jt( ~Fm,k)) is the probability of a target capitalization tag given a source capitalization tag and can be easily estimated from a word-aligned bilingual corpus. This feature attempts to help when fcap[?]t1 fails (i.e., the capitalized word pair is unseen). Smoothing is also applied to both p(eij ~fm,k) and p(t(Ei)jt( ~Fm,k)) to handle unseen words (or word pairs).</Paragraph>
      <Paragraph position="12"> Upper-case translation feature. Word Ei is in all upper case if all words in the corresponding F phrase ~Fm are in upper case. Although this feature can also be captured by the capitalization tag translation feature in the case where an AU tag in the input sentence is most probably preserved in the output sentence, we still define it to emphasize its effect. This feature aims, for example, to translate &amp;quot;ABC XYZ&amp;quot; into &amp;quot;UUU VVV&amp;quot; even if all words are unseen.</Paragraph>
      <Paragraph position="13"> Initial capitalization feature. An E word is initially capitalized if it is the first word that contains letters in the E sentence. For example, for sentence &amp;quot; Please click the button&amp;quot; that starts with a bullet, the initial capitalization feature value of word &amp;quot;please&amp;quot; is 1 because &amp;quot; &amp;quot; does not contain a letter. Punctuation feature template. An E word is initially capitalized if it follows a punctuation mark.</Paragraph>
      <Paragraph position="14"> Non-sentence-ending punctuation marks like commas will usually get negative weights.</Paragraph>
      <Paragraph position="15"> As one can see, our features are &amp;quot;coarse-grained&amp;quot; (e.g., the language model feature). In contrast, Kim and Woodland (2004) and Roark et al. (2004) use &amp;quot;fine-grained&amp;quot; features. They treat each n-gram as a feature for, respectively, monolingual capitalization and language modeling. Feature weights tuned at a fine granularity may lead to better accuracy, but they require much more training data, and result in much slower training speed, especially for large-scale learning problems. Coarse-grained features enable us to efficiently get the feature values from a very large training corpus, and quickly tune the weights on small development sets. For example, we can train a bilingual capitalization model on a 70 million-word corpus in several hours with the coarse-grained features presented above, but in several days with fine-grained n-gram count features.</Paragraph>
    </Section>
    <Section position="4" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.4 The GEN Function
</SectionTitle>
      <Paragraph position="0"> Function GEN generates the set of case-sensitive candidates from a lowercased token. For example GEN(mt) = fmt,mT,Mt,MTg. The following heuristics can be used to reduce the range of GEN. The returned set of GEN on a lower-cased token w is the union of: (i) fw,AU(w),IU(w)g, (ii) fvjv is seen in training data and AL(v) = wg, and (iii) f ~Fm,kjAL( ~Fm,k) = AL(w)g. The heuristic (iii) is designed to provide more candidates for w when it is translated from a very strange input word ~Fm,k in the F phrase ~Fm that is aligned to the phrase that w is in. This heuristic creates good capitalization candidates for the translation of URLs, file names, and file paths.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="4" end_page="5" type="metho">
    <SectionTitle>
5 Generating Phrase-Aligned Training
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
Data
</SectionTitle>
      <Paragraph position="0"> Training the bilingual capitalization model requires a bilingual corpus with phrase alignments, which are usually produced from a phrase aligner. In practice, the task of phrase alignment can be quite computationally expensive as it requires to translate the entire training corpus; also a phrase aligner is not always available. We therefore generate the training data using a na&amp;quot;ive phrase aligner (NPA) instead of resorting to a real one.</Paragraph>
      <Paragraph position="1"> The input to the NPA is a word-aligned bilingual corpus. The NPA stochastically chooses for each sentence pair one segmentation and phrase alignment that is consistent with the word alignment. An aligned phrase pair is consistent with the word alignment if neither phrase contains any word aligning to a word outside the other phrase (Och and Ney, 2004). The NPA chunks the source sentence into phrases according to a probabilistic distribution over source phrase lengths. This distribution can be obtained from the trace output of a phrase-based MT  decoder on a small development set. The NPA has to retry if the current source phrase cannot find any consistent target phrase. Unaligned target words are attached to the left phrase. Heuristics are employed to prevent the NPA from not coming to a solution.</Paragraph>
      <Paragraph position="2"> Obviously, the NPA is a special case of the phrase extractor in (Och and Ney, 2004) in that it considers only one phrase alignment rather than all possible ones.</Paragraph>
      <Paragraph position="3"> Unlike a real phrase aligner, the NPA need not wait for the training of the translation model to finish, making it possible for parallelization of translation model training and capitalization model training. However, we believe that a real phrase aligner may make phrase alignment quality higher.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="5" end_page="6" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.1 Settings
</SectionTitle>
      <Paragraph position="0"> We conducted capitalization experiments on three language pairs: English-to-French (E!F) with a bilingual corpus from the Information Technology (IT) domain; French-to-English (F!E) with a bilingual corpus from the general news domain; and Chinese-to-English (C!E) with a bilingual corpus from the general news domain as well. Each language pair comes with a training corpus, a development corpus and two test sets (see Table 2). Test-Precision is used to test the capitalization precision of the capitalizer on well-formed sentences drawn from genres similar to those used for training. Test-BLEU is used to assess the impact of our capitalizer on end-to-end translation performance; in this case, the capitalizer may operate on ungrammatical sentences. We chose to work with these three language pairs because we wanted to test our capitalization model on both English and French target MT systems and in cases where the source language has no case information (such as in Chinese).</Paragraph>
      <Paragraph position="1"> We estimated the feature functions, such as the log probabilities in the language model, from the training set. Kneser-Ney smoothing (Kneser and Ney, 1995) was applied to features fLM, fcap*t1, and fcap*tag*t1. We trained the feature weights of the CRF-based bilingual capitalization model using the development set. Since estimation of the feature weights requires the phrase alignment information, we efficiently applied the NPA on the development set.</Paragraph>
      <Paragraph position="2"> We employed two LM-based capitalizers as baselines for performance comparison: a unigram-based capitalizer and a strong trigram-based one. The unigram-based capitalizer is the usual baseline for capitalization experiments in previous work. The trigram-based baseline is similar to the one in (Lita et al., 2003) except that we used Kneser-Ney smoothing instead of a mixture.</Paragraph>
      <Paragraph position="3"> A phrase-based SMT system (Marcu and Wong, 2002) was trained on the bitext. The capitalizer was incorporated into the MT system as a post-processing module -- it capitalizes the lowercased MT output. The phrase boundaries and alignments needed by the capitalizer were automatically inferred as part of the decoding process.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
6.2 BLEU and Precision
</SectionTitle>
      <Paragraph position="0"> We measured the impact of our capitalization model in the context of an end-to-end MT system using BLEU (Papineni et al., 2001). In this context, the capitalizer operates on potentially ill-formed, MTproduced outputs.</Paragraph>
      <Paragraph position="1"> To this end, we first integrated our bilingual capitalizer into the phrase-based SMT system as a post-processing module. The decoder of the MT system was modified to provide the capitalizer with the case-preserved source sentence, the lowercased translation, and the phrase boundaries and their alignments. Based on this information, our bilingual capitalizer recovers the case information of the lowercased translation, outputting a capitalized target sentence. The case-restored machine translations were evaluated against the target test-BLEU set. For comparison, BLEU scores were also computed for an MT system that used the two LM-based baselines.</Paragraph>
      <Paragraph position="2"> We also assessed the performance of our capitalizer on the task of recovering case information for well-formed grammatical texts. To this end, we used the precision metric that counted the number of cor- null rectly capitalized words produced by our capitalizer on well-formed, lowercased input precision = #correctly capitalized words#total words (9) To obtain the capitalization precision, we implemented the capitalizer as a standalone program. The inputs to the capitalizer were triples of a case-preserved source sentence, a lowercased target sentence, and phrase alignments between them. The output was the case-restored version of the target sentence. In this evaluation scenario, the capitalizer output and the reference differ only in case information -- word choices and word orders between them are the same. Testing was conducted on Test-Precision. We applied the NPA to the Test-Precision set to obtain the phrases and their alignments because they were needed to trigger the features in testing. We used a Test-Precision set that was different from the Test-BLEU set because word alignments were by-products only of training of translation models on the MT training data and we could not put the Test-BLEU set into the MT training data. Rather than implementing a standalone word aligner, we randomly divided the MT training data into three non-overlapping sets: Test-Precision set, CRF capitalizer training set and dev set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML