File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1018_metho.xml

Size: 28,861 bytes

Last Modified: 2025-10-06 14:14:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1018">
  <Title>High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information</Title>
  <Section position="3" start_page="0" end_page="132" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Corpus-based approaches based on bilingual texts are promising for various applications(i.e., lexical knowledge extraction (Kupiec, 1993; Matsumoto et al., 1993; Smadja et al., 1996; Dagan and Church, 1994; Kumano and Hirakawa, 1994; Haruno et al., 1996), machine translation (Brown and others, 1993; Sato and Nagao, 1990; Kaji et al., 1992) and information retrieval (Sato, 1992)). Most of these works assume voluminous aligned corpora.</Paragraph>
    <Paragraph position="1"> Many methods have been proposed to align bilingual corpora. One of the major approaches is based on the statistics of simple features such as sentence length in words (Brown and others, 1991) or in characters (Gale and Church, 1993). These techniques are widely used because they can be implemented in an efficient and simple way through dynamic programing. However, their main targets are rigid translations that are almost literal translations. In addition, the texts being aligned were structurally similar European languages (i.e., English-French, English-German).</Paragraph>
    <Paragraph position="2"> The simple-feature based approaches don't work in flexible translations for structurally different languages such as Japanese and English, mainly for the following two reasons. One is the difference in the character types of the two languages. Japanese has three types of characters (Hiragana, Katakana, and Kanji), each of which has different amounts of information. In contrast, English has only one type of characters. The other is the grammatical and rhetorical difference of the two languages. First, the systems of functional (closed) words are quite different from language to language. Japanese has a quite different system of closed words, which greatly influence the length of simple features. Second, due to rhetorical difference, the number of multiple match (i.e., 1-2, 1-3, 2-1 and so on) is more than that among European languages. Thus, it is impossible in general to apply the simple-feature based methods to Japanese-English translations.</Paragraph>
    <Paragraph position="3"> One alternative alignment method is the lexicon-based approach that makes use of the word-correspondence knowledge of the two languages.</Paragraph>
    <Paragraph position="4"> (Church, 1993) employed n-grams shared by two languages. His method is also effective for Japanese-English computer manuals both containing lots of the same alphabetic technical terms. However, the method cannot be applied to general translations in structurally different languages. (Kay and Roscheisen, 1993) proposed a relaxation method to iteratively align bilingual texts using the word correspondences acquired during the alignment process. Although the method works well among European languages, the method does not work in aligning structurally different languages. In Japanese-English translations, the method does not capture enough word correspondences to permit alignment.</Paragraph>
    <Paragraph position="5"> As a result, it can align only some of the two texts.</Paragraph>
    <Paragraph position="6"> This is mainly because the syntax and rhetoric are  greatly differ in the two languages even in literal translations. The number of confident word correspondences of words is not enough for complete alignment. Thus, the problem cannot be addressed as long as the method relies only on statistics. Other methods in the lexicon-based approach embed lexical knowledge into stochastic models (Wu, 1994; Chen, 1993), but these methods were tested using rigid translations.</Paragraph>
    <Paragraph position="7"> To tackle the problem, we describe in this paper a text alignment system that uses both statistics and bilingual dictionaries at the same time. Bilingual dictionaries are now widely available on-line due to advances in CD-ROM technologies. For example, English-Spanish, English-French, English-German, English-Japanese, Japanese-French, Japanese-Chinese and other dictionaries are now commercially available. It is reasonable to make use of these dictionaries in bilingual text alignment. The pros and cons of statistics and online dictionaries are discussed below. They show that statistics and on-line dictionaries are complementary in terms of bilingual text alignment.</Paragraph>
    <Paragraph position="8"> Statistics Merit Statistics is robust in the sense that it can extract context-dependent usage of words and that it works well even if word segmentation 1 is not correct.</Paragraph>
    <Paragraph position="9"> Statistics Demerit The amount of word correspondences acquired by statistics is not enough for complete alignment.</Paragraph>
    <Paragraph position="10"> Dictionaries Merit They can contain the information about words that appear only once in the corpus.</Paragraph>
    <Paragraph position="11"> Dictionaries Demerit They cannot capture context-dependent keywords in the corpus and are weak against incorrect word segmentation.</Paragraph>
    <Paragraph position="12"> Entries in the dictionaries differ from author to author and are not always the same as those in the corpus.</Paragraph>
    <Paragraph position="13"> Our system iteratively aligns sentences by using statistical and on-line dictionary word correspondences. The characteristics of the system are as follows. null * The system performs well and is robust for various lengths (especially short) and various genres of texts.</Paragraph>
    <Paragraph position="14"> * The system is very economical because it assumes only online-dictionaries of general use and doesn't require the labor-intensive construction of domain-specific dictionaries.</Paragraph>
    <Paragraph position="15"> * The system is extendable by registering statistically acquired word correspondences into user dictionaries.</Paragraph>
    <Paragraph position="16"> 1In Japanese, there are no explicit delimiters between words. The first task for alignment is , therefore, to divide the text stream into words.</Paragraph>
    <Paragraph position="17"> We will treat hereafter Japanese-English translations although the proposed method is language independent. null The construction of the paper is as follows. First, Section 2 offers an overview of our alignment system. Section 3 describes the entire alignment algorithm in detail. Section 4 reports experimental results for various kinds of Japanese-English texts including newspaper editorials, scientific papers and critiques on economics. The evaluation is performed from two points of view: precision-recall of alignment and word correspondences acquired during alignment.</Paragraph>
    <Paragraph position="18"> Section 5 concerns related works and Section 6 concludes the paper.</Paragraph>
    <Paragraph position="19">  input to the system is a pair of Japanese and English texts, one the translation of the other. First, sentence boundaries are found in both texts using finite state transducers. The texts are then part-of-speech (POS) tagged and separated into original form words z. Original forms of English words are determined by 80 rules using the POS information. From the word sequences, we extract only nouns, adjectives, adverbs verbs and unknown words (only in Japanese) because Japanese and English closed words are different and impede text alignment. These pre-processing operation can be easily implemented with regular expressions.</Paragraph>
    <Paragraph position="20"> 2We use in this phase the JUMAN morphological analyzing system (Kurohashi et al., 1994) for tagging Japanese texts and Brill's transformation-based tagger (Brill, 1992; Brill, 1994) for tagging English texts (JU-MAN: ftp://ftp.aist-nara.ac.jp/pub/nlp/tools/juman/ Brih ftp://ftp.cs.jhu.edu/pub/brill). We would like to thank all people concerned for providing us with the tools.</Paragraph>
    <Paragraph position="21">  The initial state of the algorithm is a set of already known anchors (sentence pairs). These are determined by article boundaries, section boundaries and paragraph boundaries. In the most general case, initial anchors are only the first and final sentence pairs of both texts as depicted in Figure 2. Possible sentence correspondences are determined from the anchors. Intuitively, the number of possible correspondences for a sentence is small near anchors, while large between the anchors. In this phase, the most important point is that each set of possible sentence correspondences should include the correct correspondence.</Paragraph>
    <Paragraph position="22"> The main task of the system is to find anchors from the possible sentence correspondences by using two kinds of word correspondences: statistical word correspondences and word correspondences as held in a bilingual dictionary 3. By using both correspondences, the sentence pair whose correspondences exceeds a pre-defined threshold is judged as an anchor. These newly found anchors make word correspondences more precise in the subsequent session. By repeating this anchor setting process with threshold reduction, sentence correspondences are gradually determined from confident pairs to nonconfident pairs. The gradualism of the algorithm makes it robust because anchor-setting errors in the last stage of the algorithm have little effect on over-all performance. The output of the algorithm is the alignment result (a sequence of anchors) and word correspondences as by-products.</Paragraph>
  </Section>
  <Section position="4" start_page="132" end_page="134" type="metho">
    <SectionTitle>
3 Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="132" end_page="133" type="sub_section">
      <SectionTitle>
3.1 Statistics Used
</SectionTitle>
      <Paragraph position="0"> In this section, we describe the statistics used to decide word correspondences. From many similarity metrics applicable to the task, we choose mutual information and t-score because the relaxation of parameters can be controlled in a sophisticated manner. Mutual information represents the similarity on the occurrence distribution and t-score represents the confidence of the similarity. These two parameters permit more effective relaxation than the single parameter used in conventional methods(Kay and Roscheisen, 1993).</Paragraph>
      <Paragraph position="1"> Our basic data structure is the alignable sentence matrix (ASM) and the anchor matrix (AM).</Paragraph>
      <Paragraph position="2"> ASM represents possible sentence correspondences and consists of ones and zeros. A one in ASM indicates the intersection of the column and row constitutes a possible sentence correspondence. On the contrary, AM is introduced to represent how a sentence pair is supported by word correspondences.</Paragraph>
      <Paragraph position="3"> The i-j Element of AM indicates how many times the corresponding words appear in the i-j sentence pair. As alignment proceeds, the number of ones in ASM reduces, while the elements of AM increase.</Paragraph>
      <Paragraph position="4"> Let pi be a sentence set comprising the ith Japanese sentence and its possible English correspondences as depicted in Figure 3. For example, P2 is the set comprising Jsentence2, Esentence2 and Esentencej, which means Jsentence2 has the possibility of aligning with Esentence2 or Esentencej.</Paragraph>
      <Paragraph position="5"> The pis can be directly derived from ASM.</Paragraph>
      <Paragraph position="6">  We introduce the contingency matrix (Fung and Church, 1994) to evaluate the similarity of word occurrences. Consider the contingency matrix shown Table 1, between Japanese word wjp n and English word Weng. The contingency matrix shows: (a) the number of pis in which both wjp, and w~ng were found, (b) the number of pis in which just w~.g was found, (c) the number of pis in which just wjp, was  found, (d) the number of pis in which neither word was found. Note here that pis overlap each other and w~,~ 9 may be double counted in the contingency matrix. We count each w~,,~ only once, even if it occurs more than twice in pls.</Paragraph>
      <Paragraph position="7">  If Wjpn and weng are good translations of one another, a should be large, and b and c should be small. In contrast, if the two are not good translations of each other, a should be small, and b and c should be large. To make this argument more precise, we introduce mutual information: log prob(wjpn, Weng) prob( w p. )prob( won9 ) The probabilities are: a+c a+c prob(wjpn) - a T b + c W d - Y a+b a+b pr ob( w eng ) a+b+c+d - M a a prob( wjpn , Weng ) -- a+b+c+d- M Unfortunately, mutual information is not reliable when the number of occurrences is small. Many words occur just once which weakens the statistics approach. In order to avoid this, we employ t-score, defined below, where M is the number of Japanese sentences. Insignificant mutual information values are filtered out by thresholding t-score. For example, t-scores above 1.65 are significant at the p &gt; 0.95 confidence level.</Paragraph>
      <Paragraph position="9"/>
    </Section>
    <Section position="2" start_page="133" end_page="134" type="sub_section">
      <SectionTitle>
3.2 Basic Alignment Algorithm
</SectionTitle>
      <Paragraph position="0"> Our basic algorithm is an iterative adjustment of the Anchor Matrix (AM) using the Alignable Sentence Matrix (ASM). Given an ASM, mutual information and t-score are computed for all word pairs in possible sentence correspondences. A word combination exceeding a predefined threshold is judged as a word correspondence. In order to find new anchors, we combine these statistical word correspondences with the word correspondences in a bilingual dictionary.</Paragraph>
      <Paragraph position="1"> Each element of AM, which represents a sentence pair, is updated by adding the number of word correspondences in the sentence pair. A sentence pair containing more than a predefined number of corresponding words is determined to be a new anchor.</Paragraph>
      <Paragraph position="2"> The detailed algorithm is as follows.</Paragraph>
      <Paragraph position="3">  This step constructs the initial ASM. If the texts contain M and N sentences respectively, the ASM is an M x N matrix. First, we decide a set of anchors using article boundaries, section boundaries and so on. In the most general case, initial anchors are the first and last sentences of both texts as depicted in Figure 2. Next, possible sentence correspondences are generated. Intuitively, true correspondences are close to the diagonal linking the two anchors. We construct the initial ASM using such a function that pairs sentences near the middle of the two anchors with as many as O(~/~) (L is the number of sentences existing between two anchors) sentences in the other text because the maximum deviation can be stochastically modeled as O(~rL) (Kay and Roscheisen, 1993). The initial ASM has little effect on the alignment performance so long as it contains all correct sentence correspondences.</Paragraph>
      <Paragraph position="4">  This step constructs an AM when given an ASM and a bilingual dictionary. Let thigh, tlow, Ihigh and Izow be two thresholds for t-score and two thresholds for mutual information, respectively. Let ANC be the minimal number of corresponding words for a sentence pair to be judged as an anchor.</Paragraph>
      <Paragraph position="5"> First, mutual information and t-score are computed for all word pairs appearing in a possible sentence correspondence in ASM. We use hereafter the word correspondences whose mutual information exceeds Itow and whose t-score exceeds ttow. For all possible sentence correspondences Jsentencei and Esentencej (any pair in ASM), the following operations are applied in order.</Paragraph>
      <Paragraph position="6">  1. If the following three conditions hold, add 3 to the i-j element of AM. (1) Jsentencei and Esentencej contain a bilingual dictionary word correspondence (wjpn and w,ng). (2) w~na does not occur in any other English sentence that is a possible translation of Jsentencei. (3) Jsentencei and Esentencej do not cross any sentence pair that has more than ANC word correspondences.</Paragraph>
      <Paragraph position="7"> 2. If the following three conditions hold, add 3 to the i-j element of AM. (1) Jsentencei and Esentencej contain a stochastic word correspondence (wjpn and w~na) that has mutual information Ihig h and whose t-score exceeds thigh. (2) w~g does not occur in any other English sentence that is a possible translation of Jsentencei. (3) Jsentencei and Esentencej do not cross any sentence pair that has more than ANC word correspondences.</Paragraph>
      <Paragraph position="8"> 3. If the following three conditions hold, add 1  to the i-j element of AM. (1) Jsentencei and Esentencej contain a stochastic word correspondence (wjp~ and we~g) that has mutual  information Itoto and whose t-score exceeds ttow. (2) w~na does not occur in any other English sentence that is a possible translation of Jsentencei. (3) Jsentencei and Esentencej does not cross any sentence pair that has more than ANC word correspondences.</Paragraph>
      <Paragraph position="9"> The first operation deals with word correspondences in the bilingual dictionary. The second operation deals with stochastic word correspondences which are highly confident and in many cases involve domain specific keywords. These word correspondences are given the value of 3. The third operation is introduced because the number of highly confident corresponding words are too small to align all sentences. Although word correspondences acquired by this step are sometimes false translations of each other, they play a crucial role mainly in the final iterations phase. They are given one point.</Paragraph>
      <Paragraph position="10">  This step adjusts ASM using the AM constructed by the above operations. The sentence pairs that have at least ANC word correspondences are determined to be new anchors. By using the new set of anchors, a new ASM is constructed using the same method as used for initial ASM construction.</Paragraph>
      <Paragraph position="11"> Our algorithm implements a kind of relaxation by gradually reducing flow, Izow and ANC, which enables us to find confident sentence correspondences first. As a result, our method is more robust than dynamic programing techniques against the shortage of word-correspondence knowledge.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="134" end_page="135" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In this section, we report the result of experiments on aligning sentences in bilingual texts and on statistically acquired word correspondences. The texts for the experiment varied in length and genres as summarized in Table 2. Texts 1 and 2 are editorials taken from 'Yomiuri Shinbun' and its English version 'Daily Yomiuri'. This data was distributed electrically via a WWW server 4. The first two texts clarify the systems's performance on shorter texts. Text  data.</Paragraph>
    <Paragraph position="1"> ~We obtained the data from paper version of the magazine by using OCR. We would like to thank Nikkei Science Co. for permitting us to use the data.</Paragraph>
    <Paragraph position="2"> categories of matches by manual alignment and indicate the difficulty of the task.</Paragraph>
    <Paragraph position="3"> Our evaluation focuses on much smaller texts than those used in other study(Brown and others, 1993; Gale and Church, 1993; Wu, 1994; Fung, 1995; Kay and Roscheisen, 1993) because our main targets are well-separated articles. However, our method will work on larger and noisy sets too, by using word anchors rather than using sentence boundaries as segment boundaries. In such a case, the method constructing initial ASM needs to be modified.</Paragraph>
    <Paragraph position="4"> We briefly report here the computation time of our method. Let us consider Text 4 as an example. After 15 seconds for full preprocessing, the first iteration took 25 seconds with tto~ = 1.55 and Izow = 1.8. The rest of the algorithm took 20 seconds in all. This experiment was performed on a SPARC Station 20 Model tIS21. From the result, we may safely say that our method can be applied to voluminous corpora.</Paragraph>
    <Section position="1" start_page="134" end_page="135" type="sub_section">
      <SectionTitle>
4.1 Sentence Alignment
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the performance on sentence alignments for the texts in Table 2. Combined, Statistics and Dictionary represent the methods using both statistics and dictionary, only statistics and only dictionary, respectively. Both Combined and Dictionary use a CD-ROM version of a Japanese-English dictionary containing 40 thousands entries. Statistics repeats the iteration by using statistical corresponding words only. This is identical to Kay's method (Kay and Roscheisen, 1993) except for the statistics used. Dictionary performs the iteration of the algorithm by using corresponding words of the bilingual dictionary. This delineates the coverage of the dictionary. The parameter setting used for each method was the optimum as determined by empirical tests.</Paragraph>
      <Paragraph position="1"> In Table 3, PRECISION delineates how many of the aligned pairs are correct and RECALL delineates how many of the manual alignments we included in systems output. Unlike conventional sentencechunk based evaluations, our result is measured on the sentence-sentence basis. Let us consider a 3-1 matching. Although conventional evaluations can make only one error from the chunk, three errors may arise by our evaluation. Note that our evaluation is more strict than the conventional one, especially for difficult texts, because they contain more complex matches.</Paragraph>
      <Paragraph position="2"> For Text 1 and Text 2, both the combined method and the dictionary method perform much better than the statistical method. This is obviously because statistics cannot capture wordcorrespondences in the case of short texts.</Paragraph>
      <Paragraph position="3"> Text 3 is easy to align in terms of both the complexity of the alignment and the vocabularies used. All methods performed well on this text.</Paragraph>
      <Paragraph position="4"> For Text 4, Combined and Statistics perform  much better than Dictionary. The reason for this is that Text 4 concerns brain science and the bilingual dictionaries of general use did not contain domain specific keywords. On the other hand, the combined and statistical methods well capture the keywords as described in the next section. Note here that Combined performs better than Statistics in the case of longer texts, too. There is clearly a limitation in the amount of word correspondences that can be captured by statistics. In summary, the performance of Combined is better than either Statistics or Dictionary for all texts, regardless of text length and the domain.</Paragraph>
      <Paragraph position="5"> correspondences were not used.</Paragraph>
      <Paragraph position="6"> Although these word correspondences are very effective for sentence alignment task, they are unsatisfactory when regarded as a bilingual dictionary. For example, ' 7 7 Y ~' ~ ~ ~n.MR I ' in Japanese is the translation of 'functional MRI'. In Table 4, the correspondence of these compound nouns was captured only in their constituent level. (Haruno et al., 1996) proposes an efficient n-gram based method to extract bilingual collocations from sentence aligned bilingual corpora.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="135" end_page="136" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="135" end_page="136" type="sub_section">
      <SectionTitle>
4.2 Word Correspondence
</SectionTitle>
      <Paragraph position="0"> In this section, we will demonstrate how well the proposed method captured domain specific word correspondences by using Text 4 as an example. Table 4 shows the word correspondences that have high mutual information. These are typical keywords concerning the non-invasive approach to human brain analysis. For example, NMR, MEG, PET, CT, MRI and functional MRI are devices for measuring brain activity from outside the head. These technical terms are the subjects of the text and are essential for alignment. However, none of them have their own entry in the bilingual dictionary, which would strongly obstruct the dictionary method.</Paragraph>
      <Paragraph position="1"> It is interesting to note that the correct Japanese translation of 'MEG' is ' ~{i~i~\]'. The Japanese morphological analyzer we used does not contain an entry for ' ~i~i\[~' and split it into a sequence of three characters ' ~',' ~' and ' \[\]'. Our system skillfully combined ' ~i' and ' \[\]' with 'MEG', as a result of statistical acquisition. These word correspondences greatly improved the performance for Text 4. Thus, the statistical method well captures the domain specific keywords that are not included in general-use bilingual dictionaries. The dictionary method would yield false alignments if statistically acquired word Sentence alignment between Japanese and English was first explored by Sato and Murao (Murao, 1991).</Paragraph>
      <Paragraph position="2"> They found (character or word) length-based approaches were not appropriate due to the structural difference of the two languages. They devised a dynamic programming method based on the number of corresponding words in a hand-crafted bilingual dictionary. Although some results were promising, the method's performance strongly depended on the domain of the texts and the dictionary entries.</Paragraph>
      <Paragraph position="3"> (Utsuro et al., 1994) introduced a statistical post-processing step to tackle the problem. He first applied Sato's method and extracted statistical word correspondences from the result of the first path.</Paragraph>
      <Paragraph position="4"> Sato's method was then reiterated using both the acquired word correspondences and the hand-crafted dictionary. His method involves the following two problems. First, unless the hand-crafted dictionary contains domain specific key words, the first path yields false alignment, which in turn leads to false statistical correspondences. Because it is impossible in general to cover key words in all domains, it is inevitable that statistics and hand-crafted bilingual dictionaries must be used at the same time.</Paragraph>
      <Paragraph position="5">  The proposed method involves iterative alignment which simultaneously uses both statistics and a bilingual dictionary.</Paragraph>
      <Paragraph position="6"> Second, their score function is not reliable especially when the number of corresponding words contained in corresponding sentences is small. Their method selects a matching type (such as 1-1, 1-2 and 2-1) according to the number of word correspondences per contents word. However, in many cases, there are a few word translations in a set of corresponding sentences. Thus, it is essential to decide sentence alignment on the sentence-sentence basis.</Paragraph>
      <Paragraph position="7"> Our iterative approach decides sentence alignment level by level by counting the word correspondences between a Japanese and an English sentence.</Paragraph>
      <Paragraph position="8"> (Fung and Church, 1994; Fung, 1995) proposed methods to find Chinese-English word correspondences without aligning parallel texts. Their motivation is that structurally different languages such as Chinese-English and Japanese-English are difficult to align in general. Their methods bypassed aligning sentences and directly acquired word correspondences. Although their approaches are robust for noisy corpora and do not require any information source, aligned sentences are necessary for higher level applications such as well-grained translation template acquisition (Matsumoto et as., 1993; Smadja et al., 1996; Haruno et al., 1996) and example-based translation (Sato and Nagao, 1990). Our method performs accurate alignment for such use by combining the detailed word correspondences: statistically acquired word correspondences and those from a bilingual dictionary of general use.</Paragraph>
      <Paragraph position="9"> (Church, 1993) proposed char_align that makes use of n-grams shared by two languages. This kind of matching techniques will be helpful in our dictionary-based approach in the following situation: Entries of a bilingual dictionary do not completely match the word in the corpus but partially do. By using the matching technique, we can make the most of the information compiled in bilingual dictionaries.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML