File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/j98-4003_metho.xml

Size: 18,714 bytes

Last Modified: 2025-10-06 14:14:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="J98-4003">
  <Title>Machine Transliteration</Title>
  <Section position="3" start_page="600" end_page="602" type="metho">
    <SectionTitle>
2. A Modular Learning Approach
</SectionTitle>
    <Paragraph position="0"> Bilingual glossaries contain many entries mapping katakana phrases onto English phrases, e.g., (aircraft carrier ~ sT ~ ~ ~ 1- ~ -~- ~J T ). It is possible to automatically analyze such pairs to gain enough knowledge to accurately map new katakana phrases that come along, and this learning approach travels well to other language pairs. A naive approach to finding direct correspondences between English letters and katakana  Computational Linguistics Volume 24, Number 4 symbols, however, suffers from a number of problems. One can easily wind up with a system that proposes iskrym as a back-transliteration of aisukuriimu. Taking letter frequencies into account improves this to a more plausible-looking isclim. Moving to real words may give is crime: the i corresponds to ai, the s corresponds to su, etc. Unfortunately, the correct answer here is ice cream.</Paragraph>
    <Paragraph position="1"> After initial experiments along these lines, we stepped back and built a generative model of the transliteration process, which goes like this:  An English phrase is written.</Paragraph>
    <Paragraph position="2"> A translator pronounces it in English.</Paragraph>
    <Paragraph position="3"> The pronunciation is modified to fit the Japanese sound inventory.</Paragraph>
    <Paragraph position="4"> The sounds are converted into katakana.</Paragraph>
    <Paragraph position="5"> Katakana is written.</Paragraph>
    <Paragraph position="6"> This divides our problem into five subproblems. Fortunately, there are techniques for coordinating solutions to such subproblems, and for using generative models in the reverse direction. These techniques rely on probabilities and Bayes' theorem. Suppose we build an English phrase generator that produces word sequences according to some probability distribution P(w). And suppose we build an English pronouncer that takes a word sequence and assigns it a set of pronunciations, again probabilistically, according to some P(plw). Given a pronunciation p, we may want to search for the word sequence w that maximizes P(wlp ). Bayes' theorem lets us equivalently maximize P(w) * P(plw), exactly the two distributions we have modeled.</Paragraph>
    <Paragraph position="7"> Extending this notion, we settled down to build five probability distributions:  Given a katakana string o observed by OCR, we want to find the English word sequence w that maximizes the sum, over all e, j, and k, of P(w). P(elw). P(jle). P(klj) * P(olk) Following Pereira and Riley (1997), we implement P(w) in a weighted finite-state acceptor (WFSA) and we implement the other distributions in weighted finite-state transducers (WFSTs). A WFSA is a state/transition diagram with weights and symbols on the transitions, making some output sequences more likely than others. A WFST is a WFSA with a pair of symbols on each transition, one input and one output. Inputs and outputs may include the empty symbol C/. Also following Pereira and Riley (1997), we have implemented a general composition algorithm for constructing an integrated model P(xlz) from models P(xly ) and P(y\[z), treating WFSAs as WFSTs with identical inputs and outputs. We use this to combine an observed katakana string with each  Knight and Graehl Machine Transliteration of the models in turn. The result is a large WFSA containing all possible English translations.</Paragraph>
    <Paragraph position="8"> We have implemented two algorithms for extracting the best translations. The first is Dijkstra's shortest-path graph algorithm (Dijkstra 1959). The second is a recently discovered k-shortest-paths algorithm (Eppstein 1994) that makes it possible for us to identify the top k translations in efficient O(m + n log n + kn) time, where the WFSA contains n states and m arcs.</Paragraph>
    <Paragraph position="9"> The approach is modular. We can test each engine independently and be confident that their results are combined correctly. We do no pruning, so the final WFSA contains every solution, however unlikely. The only approximation is the Viterbi one, which searches for the best path through a WFSA instead of the best sequence (i.e., the same sequence does not receive bonus points for appearing more than once).</Paragraph>
  </Section>
  <Section position="4" start_page="602" end_page="608" type="metho">
    <SectionTitle>
3. Probabilistic Models
</SectionTitle>
    <Paragraph position="0"> This section describes how we designed and built each of our five models. For consistency, we continue to print written English word sequences in italics (golf ball), English sound sequences in all capitals (G AA L F B A0 L), Japanese sound sequences in lower case (g o r u h u b o o r u)and katakana sequences naturally (~,,~7,~--)t,).</Paragraph>
    <Section position="1" start_page="602" end_page="602" type="sub_section">
      <SectionTitle>
3.1 Word Sequences
</SectionTitle>
      <Paragraph position="0"> The first model generates scored word sequences, the idea being that ice cream should score higher than ice creme, which should score higher than aice kreem. We adopted a simple unigram scoring method that multiplies the scores of the known words and phrases in a sequence. Our 262,000-entry frequency list draws its Words and phrases from the Wall Street Journal corpus, an on-line English name list, and an on-line gazetteer of place names, l A portion of the WFSA looks like this: los / 0.000087 federal / 0.001~ angeleP D month / 0.000992 An ideal word sequence model would look a bit different. It would prefer exactly those strings which are actually grist for Japanese transliterators. For example, people rarely transliterate auxiliary verbs, but surnames are often transliterated. We have approximated such a model by removing high-frequency words like has, an, are, am, were, their, and does, plus unlikely words corresponding to Japanese sound bites, like coup and oh.</Paragraph>
      <Paragraph position="1"> We also built a separate word sequence model containing only English first and last names. If we know (from context) that the transliterated phrase is a personal name, this model is more precise.</Paragraph>
    </Section>
    <Section position="2" start_page="602" end_page="603" type="sub_section">
      <SectionTitle>
3.2 Words to English Sounds
</SectionTitle>
      <Paragraph position="0"> The next WFST converts English word sequences into English sound sequences. We use the English phoneme inventory from the on-line CMU Pronunciation Dictio- null Computational Linguistics Volume 24, Number 4 nary, minus the stress marks. 2 This gives a total of 40 sounds, including 14 vowel sounds (e.g., AA, AE, UN), 25 consonant sounds (e.g., K, HH, R), plus one special symbol (PAUSE). The dictionary has pronunciations for 110,000 words, and we organized a tree-based WFST from it: E:E ff:z Note that we insert an optional PAUSE between word pronunciations. We originally thought to build a general letter-to-sound WFST (Divay and Vitale 1997), on the theory that while wrong (overgeneralized) pronunciations might occasionally be generated, Japanese transliterators also mispronounce words. However, our letter-to-sound WFST did not match the performance of Japanese transliterators, and it turns out that mispronunciations are modeled adequately in the next stage of the cascade.</Paragraph>
    </Section>
    <Section position="3" start_page="603" end_page="606" type="sub_section">
      <SectionTitle>
3.3 English Sounds to Japanese Sounds
</SectionTitle>
      <Paragraph position="0"> Next, we map English sound sequences onto Japanese sound sequences. This is an inherently information-losing process, as English R and L sounds collapse onto Japanese r, the 14 English vowel sounds collapse onto the 5 Japanese vowel sounds, etc. We face two immediate problems:  1. What is the target Japanese sound inventory? 2. How can we build a WFST to perform the sequence mapping?  An obvious target inventory is the Japanese syllabary itself, written down in katakana (e.g., = ) or a roman equivalent (e.g., ni). With this approach, the English sound K corresponds to one of ~ (ka), ~ (ki), ~ (ku), ~r (ke), or = (ko), depending on its context. Unfortunately, because katakana is a syllabary, we would be unable to express an obvious and useful generalization, namely that English K usually corresponds to Japanese k, independent of context. Moreover, the correspondence of Japanese katakana writing to Japanese sound sequences is not perfectly one-to-one (see Section 3.4), so an independent sound inventory is well-motivated in any case. Our Japanese sound inventory includes 39 symbols: 5 vowel sounds, 33 consonant sounds (including doubled consonants like kk), and one special symbol (pause). An English sound sequence like (P R 0W PAUSE S AA K ER) might map onto a Japanese sound sequence like (p u r o pause s a kk a a). Note that long Japanese vowel sounds  Knight and Graehl Machine Transliteration are written with two symbols (a a) instead of just one (aa). This scheme is attractive because Japanese sequences are almost always longer than English sequences. Our WFST is learned automatically from 8,000 pairs of English/Japanese sound sequences, e.g., ((S AA K ER) ~ (s a kk a a) ). We were able to produce these pairs by manipulating a small English-katakana glossary. For each glossary entry, we converted English words into English sounds using the model described in the previous section, and we converted katakana words into Japanese sounds using the model we describe in the next section. We then applied the estimation-maximization (EM) algorithm (Baum 1972; Dempster, Laird, and Rubin 1977) to generate symbol-mapping probabilities, shown in Figure 2. Our EM training goes like this: 1. For each English/Japanese sequence pair, compute all possible alignments between their elements. In our case, an alignment is a drawing that connects each English sound with one or more Japanese &amp;quot; sounds, such that all Japanese sounds are covered and no lines cross. For example, there are two ways to align the pair ((L 0W) &lt;-&gt; (r o o)): L OW L OW / /k \ r o o r o o In this case, the alignment on the left is intuitively preferable. The algorithm learns such preferences.</Paragraph>
      <Paragraph position="1">  2. For each pair, assign an equal weight to each of its alignments, such that those weights sum to 1. In the case above, each alignment gets a weight of 0.5.</Paragraph>
      <Paragraph position="2"> 3. For each of the 40 English sounds, count up instances of its different mappings, as observed in all alignments of all pairs. Each alignment contributes counts in proportion to its own weight.</Paragraph>
      <Paragraph position="3"> 4. For each of the 40 English sounds, normalize the scores of the Japanese sequences it maps to, so that the scores sum to 1. These are the symbol-mapping probabilities shown in Figure 2.</Paragraph>
      <Paragraph position="4"> 5. Recompute the alignment scores. Each alignment is scored with the product of the scores of the symbol mappings it contains. Figure 3 shows sample alignments found automatically through EM training.</Paragraph>
      <Paragraph position="5"> 6. Normalize the alignment scores. Scores for each pair's alignments should sum to 1.</Paragraph>
      <Paragraph position="6"> 7. Repeat 3--6 until the symbol-mapping probabilities converge.</Paragraph>
      <Paragraph position="7">  We then build a WFST directly from the symbol-mapping probabilities:  lower case), as learned by estimation-maximization. Only mappings with conditional probabilities greater than 1% are shown, so the figures may not sum to 1. We have also built models that allow individual English sounds to be &amp;quot;swallowed&amp;quot; (i.e., produce zero Japanese sounds). However, these models are expensive to compute (many more alignments) and lead to a vast number of hypotheses during WFST composition. Furthermore, in disallowing &amp;quot;swallowing,&amp;quot; we were able to automatically remove hundreds of potentially harmful pairs from our training set, e.g., ((B AA R B ER SH AA P) ~ (b a a b a a)). Because no alignments are possible, such pairs are skipped by the learning algorithm; cases like these must be solved by dictionary  Alignments between English and Japanese sound sequences, as determined by EM training. Best alignments are shown for the English words biscuit, divider, and filter. lookup anyway. Only two pairs failed to align when we wished they had--both involved turning English Y UW into Japanese u, as in ((Y uw K AH L EY L IY) ~ (u k ur e r e)).</Paragraph>
      <Paragraph position="8"> Note also that our model translates each English sound without regard to context. We have also built context-based models, using decision trees recoded as WFSTs. For example, at the end of a word, English T is likely to come out as (t o) rather than (t). However, context-based models proved unnecessary for back-transliteration. They are more useful for English-to-Japanese forward transliteration.</Paragraph>
    </Section>
    <Section position="4" start_page="606" end_page="607" type="sub_section">
      <SectionTitle>
3.4 Japanese Sounds to Katakana
</SectionTitle>
      <Paragraph position="0"> To map Japanese sound sequences like (m o o t a a) onto katakana sequences like (~-- ~- ), we manually constructed two WFSTs. Composed together, they yield an integrated WFST with 53 states and 303 arcs, producing a katakana inventory containing 81 symbols, including the dot-separator (-). The first WFST simply merges long Japanese vowel sounds into new symbols aa, ii, uu, ee, and oo. The second WFST maps Japanese sounds onto katakana symbols. The basic idea is to consume a whole syllable worth of sounds before producing any katakana. For example: o: pause: * / 0. 7  Computational Linguistics Volume 24, Number 4 This fragment shows one kind of spelling variation in Japanese: long vowel sounds (oo) are usually written with a long vowel mark (~-) but are sometimes written with repeated katakana ( ~- ). We combined corpus analysis with guidelines from a Japanese textbook (Jorden and Chaplin 1976) to turn up many spelling variations and unusual katakana symbols:  and so on.</Paragraph>
      <Paragraph position="1"> Spelling variation is clearest in cases where an English word like switch shows up transliterated variously ( :z 4 7 ~-, :z 4 ~, ~, :z ~ 4 ~ ~ ) in different dictionaries. Treating these variations as an equivalence class enables us to learn general sound mappings even if our bilingual glossary adheres to a single narrow spelling convention. We do not, however, generate all katakana sequences with this model; for example, we do not output strings that begin with a subscripted vowel katakana. So this model also serves to ter out some ill-formed katakana sequences, possibly proposed by optical character recognition.</Paragraph>
    </Section>
    <Section position="5" start_page="607" end_page="608" type="sub_section">
      <SectionTitle>
3.5 Katakana to OCR
</SectionTitle>
      <Paragraph position="0"> Perhaps uncharitably, we can view optical character recognition (OCR) as a device that garbles perfectly good katakana sequences. Typical confusions made by our commercial OCR system include '~ for ~-', ~ for ~, T for 7, and 7 for 7&amp;quot;. To generate pre-OCR text, we collected 19,500 characters worth of katakana words, stored them in a e, and printed them out. To generate post-OCR text, we OCR'd the printouts. We then ran the EM algorithm to determine symbol-mapping (&amp;quot;garbling&amp;quot;) probabilities.</Paragraph>
      <Paragraph position="1"> Here is part of that table:  Knight and Graehl Machine Transliteration This model outputs a superset of the 81 katakana symbols, including spurious quote marks, alphabetic symbols, and the numeral 7. 3</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="608" end_page="608" type="metho">
    <SectionTitle>
4. A Sample Back-transliteration
</SectionTitle>
    <Paragraph position="0"> We can now use the models to do a sample back-transliteration. We start with a katakana phrase as observed by OCR. We then serially compose it with the models, in reverse order. Each intermediate stage is a WFSA that encodes many possibilities. The final stage contains all back-transliterations suggested by the models, and we finally extract the best one.</Paragraph>
    <Paragraph position="1"> We start with the masutaazutoonamento problem from Section 1. Our OCR observes: null vx~-x'F-T} w F This string has two recognition errors: ~ (ku) for 9 (ta), and ~ (chi) for 9- (na). We turn the string into a chained 12-state/11-arc WFSA and compose it with the P(klo ) model. This yields a fatter 12-state/15-arc WFSA, which accepts the correct spelling at a lower probability. Next comes the P(jlk) model, which produces a 28-state/31-arc WFSA whose highest-scoring sequence is: masutaazut oochiment o Next comes P(elj ), yielding a 62-state/241-arc WFSA whose best sequence is: M AE S T AE AE DH UH T A0 A0 CH IH M EH N T A0 Next to last comes P(wle ), which results in a 2982-state/4601-arc WFSA whose best sequence (out of roughly three hundred million) is: masters tone am ent awe This English string is closest phonetically to the Japanese, but we are willing to trade phonetic proximity for more sensical English; we rescore this WFSA by composing it with P(w) and extract the best translation: masters tournament Other Section I examples (aasudee and robaato shyoon renaado) are translated correctly as earth day and robert sean leonard.</Paragraph>
    <Paragraph position="2"> We may also be interested in the k best translations. In fact, after any composition, we can inspect several high-scoring sequences using the algorithm of Eppstein (1994). Given the following katakana input phrase:</Paragraph>
  </Section>
  <Section position="6" start_page="608" end_page="609" type="metho">
    <SectionTitle>
3 A more thorough OCR model would train on a wide variety of fonts and photocopy distortions. In
</SectionTitle>
    <Paragraph position="0"> practice, such degradations can easily overwhelm even the better OCR systems.</Paragraph>
    <Paragraph position="1">  Inspecting the k-best list is useful for diagnosing problems with the models. If the right answer appears low in the list, then some numbers are probably off somewhere. If the right answer does not appear at all, then one of the models may be missing a word or suffer from some kind of brittleness. A k-best list can also be used as input to a later context-based disambiguator, or as an aid to a human translator.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML