File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-1004_metho.xml
Size: 27,922 bytes
Last Modified: 2025-10-06 14:13:23
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-1004"> <Title>A Program for Aligning Sentences in Bilingual Corpora</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> AT&T Bell Laboratories Kenneth W. Church* AT&T Bell Laboratories </SectionTitle> <Paragraph position="0"> Researchers in both machine translation (e.g., Brown et al. 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann 1990) have recently become interested in studying bilingual corpora, bodies of text such as the Canadian Hansards (parliamentary proceedings), which are available in multiple languages (such as French and English). One useful step is to align the sentences, that is, to identify correspondences between sentences in one language and sentences in the other language.</Paragraph> <Paragraph position="1"> This paper will describe a method and a program (align) for aligning sentences based on a simple statistical model of character lengths. The program uses the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. A probabilistic score is assigned to each proposed correspondence of sen tences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference. This probabilistic score is used in a dynamic programming framework to find the maximum likelihood alignment of sentences.</Paragraph> <Paragraph position="2"> It is remarkable that such a simple approach works as well as it does. An evaluation was performed based on a trilingual corpus of economic reports issued by the Union Bank of Switzerland (UBS) in English, French, and German. The method correctly aligned all but 4% of the sentences. Moreover, it is possible to extract a large subcorpus that has a much smaller error rate. By selecting the best-scoring 80% of the alignments, the error rate is reduced from 4% to 0.7%. There were more errors on the English-French subcorpus than on the English-German subcorpus, showing that error rates will depend on the corpus considered; however, both were small enough to hope that the method will be useful for many language pairs.</Paragraph> <Paragraph position="3"> To further research on bilingual corpora, a much larger sample of Canadian Hansards (approximately 90 million words, half in English and and half in French) has been aligned with the align program and will be available through the Data Collection Initiative of the Association for Computational Linguistics (ACL/DCI). In addition, in order to facilitate replication of the align program, an appendix is provided with detailed c-code of the more difficult core of the align program.</Paragraph> </Section> <Section position="2" start_page="0" end_page="78" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Researchers in both machine translation (e.g., Brown et al. 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann 1990) have recently become interested in studying bilingual corpora, bodies of text such as the Canadian Hansards (parliamentary debates), which are available in multiple languages (such as French and English).</Paragraph> <Paragraph position="1"> The sentence alignment task is to identify correspondences between sentences in one * AT&T Bell Laboratories 600 Mountain Avenue Murray Hill, NJ, 07974 (~) 1993 Association for Computational Linguistics Computational Linguistics Volume 19, Number 1 Table 1 Input to alignment program.</Paragraph> <Paragraph position="2"> English French According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products. Cola drink manufacturers in particular achieved above-average growth rates. The higher turnover was largely due to an increase in the sales volume. Employment and investment levels also climbed. Following a two-year transitional period, the new Foodstuffs Ordinance for Mineral Water came into effect on April 1, 1988. Specifically, it contains more stringent requirements regarding quality consistency and purity guarantees.</Paragraph> <Paragraph position="3"> Quant aux eaux min6rales et aux limonades, elles rencontrent toujours plus d'adeptes. En effet, notre sondage fait ressortir des ventes nettement sup6rieures a celles de 1987, pour les boissons ~ base de cola notamment. La progression des chiffres d'affaires r6sulte en grande partie de l'accroissement du volume des ventes. Uemploi et les investissements ont 6galement augment6. La nouvelle ordonnance f6d6rale sur les denr6es alimentaires concernant entre autres les eaux min6rales, entr6e en vigueur le ler avril 1988 apr6s une p6riode transitoire de deux ans, exige surtout une plus grande constance dans la qualit6 et une garantie de la puret6.</Paragraph> <Paragraph position="4"> language and sentences in the other language. This task is a first step toward the more ambitious task finding correspondences among words. 1 The input is a pair of texts such as Table 1. The output identifies the alignment between sentences. Most English sentences match exactly one French sentence, but it is possible for an English sentence to match two or more French sentences. The first two English sentences in Table 2 illustrate a particularly hard case where two English sentences align to two French sentences. No smaller alignments are possible because the clause &quot;... sales ...were higher...&quot; in the first English sentence corresponds to (part of) the second French sentence. The next two alignments below illustrate the more typical case where one English sentence aligns with exactly one French sentence. The final alignment matches two English sentences to a single French sentence. These alignments agreed with the results produced by a human judge. Aligning sentences is just a first step toward constructing a probabilistic dictionary (Table 3) for use in aligning words in machine translation (Brown et al. 1990), or for constructing a bilingual concordance (Table 4) for use in lexicography (Klavans and Tzoukermann 1990).</Paragraph> <Paragraph position="5"> Although there has been some previous work on the sentence alignment (e.g., Brown, Lai, and Mercer 1991 \[at IBM\], Kay and R6scheisen \[this issue; at Xerox\], and Catizone, Russell, and Warwick, in press \[at ISSCO\], the alignment task remains a significant obstacle preventing many potential users from reaping many of the benefits of bilingual corpora, because the proposed solutions are often unavailable, unreliable, and/or computationally prohibitive.</Paragraph> <Paragraph position="6"> Most of the previous work on sentence alignment has yet to be published. Kay's draft (Kay and R6scheisen; this issue), for example, was written more than two years ago and is still unpublished. Similarly the IBM work is also several years old, but not</Paragraph> <Section position="1" start_page="76" end_page="77" type="sub_section"> <SectionTitle> English French </SectionTitle> <Paragraph position="0"> According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products. Cola drink manufacturers in particular achieved above-average growth rates.</Paragraph> <Paragraph position="1"> Quant aux eaux min6rales et aux limonades, elles rencontrent toujours plus d'adeptes. En effet, notre sondage fait ressortir des ventes nettement sup6rieures a celles de 1987, pour les boissons a base de cola notamment.</Paragraph> <Paragraph position="2"> The higher turnover was largely due to an La progression des chiffres d'affaires r6sulte increase in the sales volume, en grande partie de l'accroissement du volume des ventes.</Paragraph> <Paragraph position="3"> Employment and investment levels also L'emploi et les investissements ont 6galeclimbed, ment augment6.</Paragraph> <Paragraph position="4"> Following a two-year transitional period, the new Foodstuffs Ordinance for Mineral Water came into effect on April 1, 1988. Specifically, it contains more stringent requirements regarding quality consistency and purity guarantees.</Paragraph> <Paragraph position="5"> La nouvelle ordonnance f6d6rale sur les denr6es alimentaires concernant entre autres les eaux min6rales, entr6e en vigueur le ler avril 1988 apr6s une p6riode transitoire de deux ans, exige surtout une plus grande constance dans la qualit6 et une garantie de la puret6.</Paragraph> <Paragraph position="6"> Table 3 An entry in a probabilistic dictionary.</Paragraph> <Paragraph position="7"> (from Brown et al. 1990) very well documented in the published literature; consequently, there has been a lot of unnecessary subsequent work at ISSCO and elsewhere. 2 The method we describe has the same sentenceqength basis as does that of Brown, Lai, and Mercer, while the two differ considerably from the lexical approaches tried by Kay and R6scheisen and by Catizone, Russell, and Warwick. The feasibility of other methods has varied greatly. Kay's approach is apparently quite slow. At least, with the currently inefficient implementation, it might take hours 2 After we finished most of this work, it came to our attention that the IBM MT group has at least four papers that mention sentence alignment. (Brown et al. 1988a,b) start from a set of aligned sentences, suggesting that they had a solution to the sentence alignment problem back in 1988. Brown et al. (1990) mention that sentence lengths formed the basis of their method. The draft by Brown, Lai, and Mercer (1991) describes their process without giving equations.</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 19, Number 1 Table 4 A bilingual concordance.</Paragraph> <Paragraph position="9"> bank/banque (&quot;money&quot; sense) it could also be a place where we would have a bank of experts. SENT i know several people who a ftre le lieu oti se retrouverait une esp6ce de banque d' experts. SENT je connais plusieurs pers f finance (mr. wilson) and the governor of the es finances ( m. wilson ) et le gouverneur de la reduced by over 800 per cent in one week through us de 800 p. 100 en une semaine a cause d'une bank of canada have frequently on behalf of the ca banque du canada ont fr6quemment utilis6 au co bank action. SENT there was a haberdasher who wou banque. SENT voila un chemisier qui aurait appr bank/banc (&quot;place&quot; sense) h a forum. SENT such was the case in the georges entre les 6tats-unis et le canada a propos du han i did. SENT he said the nose and tail of the gouvernement avait c6d6 les extr6mit6s du he fishing privileges on the nose and tail of the les privil6ges de p~che aux extr6mit6s du bank issue which was settled between canada and th banc de george. SENT c'est dans le but de r6 bank were surrendered by this government. SENT th banc. SENT en fait, lors des n6gociations de 1 bank went down the tube before we even negotiated banc ont 6t6 liquid6s avant rhyme qu' on ai to align a single Scientific American article (Kay, personal communication). It ought to be possible to achieve fairly reasonable results with much less computation. The IBM algorithm is much more efficient since they were able to extract nearly 3 million pairs of sentences from Hansard materials in 10 days of running time on an IBM Model 3090 mainframe computer with access to 16 megabytes of virtual memory (Brown, Lai, and Mercer 1991).</Paragraph> <Paragraph position="10"> The evaluation of results has been absent or rudimentary. Kay gives positive examples of the alignment process, but no counts of error rates. Brown, Lai, and Mercer (1991) report that they achieve a 0.6% error rate when the algorithm suggests aligning one sentence with one sentence. However, they do not characterize its performance overall or on the more difficult cases.</Paragraph> <Paragraph position="11"> Since the research community has not had access to a practical sentence alignment program, we thought that it would be helpful to describe such a program (align) and to evaluate its results. In addition, a large sample of Canadian Hansards (approximately 90 million words, half in French and half in English) has been aligned with the align program and has been made available to the general research community through the</Paragraph> </Section> <Section position="2" start_page="77" end_page="78" type="sub_section"> <SectionTitle> Data Collection Initiative of the Association for Computational Linguistics (ACL/DCI). </SectionTitle> <Paragraph position="0"> In order to facilitate replication of the align program, an appendix is provided with detailed c-code of the more difficult core of the align program.</Paragraph> <Paragraph position="1"> The align program is based on a very simple statistical model of character lengths.</Paragraph> <Paragraph position="2"> The model makes use of the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. A probabilistic score is assigned to each pair of proposed sentence pairs, based on the ratio of lengths of the two sentences (in characters) and the variance of this ratio. This probabilistic score is used in a dynamic programming framework in order to find the maximum likelihood alignment of sentences.</Paragraph> <Paragraph position="3"> It is remarkable that such a simple approach can work as well as it does. An evaluation was performed based on a trilingual corpus of 15 economic reports issued by the Union Bank of Switzerland (UBS) in English, French, and German (14,680 words, 725 sentences, and 188 paragraphs in English and corresponding numbers in William A. Gale and Kenneth W. Church Program for Aligning Sentences the other two languages). The method correctly aligned all but 4% of the sentences. Moreover, it is possible to extract a large subcorpus that has a much smaller error rate. By selecting the best-scoring 80% of the alignments, the error rate is reduced from 4% to 0.7%. There were more errors on the English-French subcorpus than on the English-German subcorpus, showing that error rates will depend on the corpus considered; however, both were small enough for us to hope that the method will be useful for many language pairs. We believe that the error rate is considerably lower in the Canadian Hansards because the translations are more literal.</Paragraph> </Section> </Section> <Section position="3" start_page="78" end_page="78" type="metho"> <SectionTitle> 2. Paragraph Alignment </SectionTitle> <Paragraph position="0"> The sentence alignment program is a two-step process. First paragraphs are aligned, and then sentences within a paragraph are aligned. It is fairly easy to align paragraphs in our trilingual corpus of Swiss banking reports since the boundaries are usually clearly marked. However, there are some short headings and signatures that can be confused with paragraphs. Moreover, these short &quot;pseudo-paragraphs&quot; are not always translated into all languages. On a corpus this small the paragraphs could have been aligned by hand. It turns out that &quot;pseudo-paragraphs&quot; usually have fewer than 50 characters and that real paragraphs usually have more than 100 characters. We used this fact to align the paragraphs automatically, checking the result by hand.</Paragraph> <Paragraph position="1"> The procedure correctly aligned all of the English and German paragraphs. However, one of the French documents was badly translated and could not be aligned because of the omission of one long paragraph and the duplication of a short one.</Paragraph> <Paragraph position="2"> This document was excluded for the purposes of the remainder of this experiment.</Paragraph> <Paragraph position="3"> We will show below that paragraph alignment is an important step, so it is fortunate that it is not particularly difficult. In aligning the Hansards, we found that paragraphs were often already aligned. For robustness, we decided to align paragraphs within certain fairly reliable regions (denoted by certain Hansard-specific formatting conventions) using the same method as that described below for aligning sentences within each paragraph.</Paragraph> </Section> <Section position="4" start_page="78" end_page="80" type="metho"> <SectionTitle> 3. A Dynamic Programming Framework </SectionTitle> <Paragraph position="0"> Now, let us consider how sentences can be aligned within a paragraph. The program makes use of the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. 3 A probabilistic score is assigned to each proposed pair of sentences, based on the ratio of lengths of the two sentences (in characters) and the variance of this ratio. This probabilistic score is used in a dynamic programming framework in order to find the maximum likelihood alignment of sentences. The fol3 We will have little to say about how sentence boundaries are identified. Identifying sentence boundaries is not always as easy as it might appear for reasons described in Liberman and Church (in press). It would be much easier if periods were always used to mark sentence boundaries; but unfortunately, many periods have other purposes. In the Brown Corpus, for example, only 90% of the periods are used to mark sentence boundaries; the remaining 10% appear in numerical expressions, abbreviations, and so forth. In the Wall Street Journal, there is even more discussion of dollar amounts and percentages, as well as more use of abbreviated titles such as Mr.; consequently, only 53% of the periods in the Wall Street Journal are used to identify sentence boundaries. For the UBS data, a simple set of heuristics were used to identify sentences boundaries. The dataset was sufficiently small that it was possible to correct the remaining mistakes by hand. For a larger dataset, such as the Canadian Hansards, it was not possible to check the results by hand. We used the same procedure that is used in Church (1988). This procedure was developed by Kathryn Baker (unpublished).</Paragraph> <Paragraph position="1"> Paragraph lengths are highly correlated. The horizontal axis shows the length of English paragraphs, while the vertical scale shows the lengths of the corresponding German paragraphs. Note that the correlation is quite large (.991).</Paragraph> <Paragraph position="2"> lowing striking figure could easily lead one to this approach. Figure 1 shows that the lengths (in characters) of English and German paragraphs are highly correlated (.991).</Paragraph> <Paragraph position="3"> Dynamic programming is often used to align two sequences of symbols in a variety of settings, such as genetic code sequences from different species, speech sequences from different speakers, gas chromatograph sequences from different compounds, and geologic sequences from different locations (Sankoff and Kruskal 1983). We could expect these matching techniques to be useful, as long as the order of the sentences does not differ too radically between the two languages. Details of the alignment techniques differ considerably from one application to another, but all use a distance measure to compare two individual elements within the sequences and a dynamic programming algorithm to minimize the total distances between aligned elements within two sequences. We have found that the sentence alignment problem fits fairly well into this framework, though it is necessary to introduce a fairly interesting innovation into the structure of the distance measure.</Paragraph> <Paragraph position="4"> Kruskal and Liberman (1983) describe distance measures as belonging to one of two classes: trace and time-warp. The difference becomes important when a single element of one sequence is being matched with multiple elements from the other. In trace applications, such as genetic code matching, the single element is matched with just one of the multiple elements, and all of the others will be ignored. In contrast, in time-warp applications such as speech template matching, the single element is matched with each of the multiple elements, and the single element will be used in multiple matches. Interestingly enough, our application does not fit into either of Delta is approximately normal. The horizontal axis shows ~, while the vertical scale shows the empirical density of delta for the hand-aligned regions as points, and a normal (0,1) density plot (lines) for comparison* The empirical density is slightly more peaked than normal (and its mean is not quite zero), but the differences are small enough for the purposes of the algorithm. Kruskal and Liberman's classes because our distance measure needs to compare the single element with an aggregate of the multiple elements.</Paragraph> </Section> <Section position="5" start_page="80" end_page="83" type="metho"> <SectionTitle> 4. The Distance Measure </SectionTitle> <Paragraph position="0"> It is convenient for the distance measure to be based on a probabilistic model so that information can be combined in a consistent way. Our distance measure is an estimate of -logProb(match I 6), where ~ depends on 11 and/2, the lengths of the two portions of text under consideration. The log is introduced here so that adding distances will produce desirable results.</Paragraph> <Paragraph position="1"> This distance measure is based on the assumption that each character in one language, L~, gives rise to a random number of characters in the other language, L2. We assume these random variables are independent and identically distributed with a normal distribution. The model is then specified by the mean, c, and variance, s 2, of this distribution, c is the expected number of characters in L2 per character in Lb and s 2 is the variance of the number of characters in L2 per character in L1. We define ~ to be (12 - llC)/V~l s2 so that it has a normal distribution with mean zero and variance one (at least when the two portions of text under consideration actually do happen to be translations of one another).</Paragraph> <Paragraph position="2"> Figure 2 is a check on the assumption that 6 is normally distributed. The figure is constructed using the parameters c and s 2 estimated for the program.</Paragraph> <Paragraph position="3"> Variance is modeled proportional to length. The horizontal axis plots the length of English paragraphs, while the vertical axis shows the square of the difference of English and German lengths, an estimate of variance. The plot indicates that variance increases with length, as predicted by the model. The line shows the result of a robust regression analysis. Five extreme points lying above the top of this figure have been suppressed since they did not contribute to the robust regression.</Paragraph> <Paragraph position="4"> The parameters c and S 2 are determined empirically from the UBS data. We could estimate c by counting the number of characters in German paragraphs then dividing by the number of characters in corresponding English paragraphs. We obtain 81105/73481 ~ 1.1. The same calculation on French and English paragraphs yields c ~ 72302/68450 ~ 1.06 as the expected number of French characters per English character. As will be explained later, performance does not seem to be very sensitive to these precise language-dependent quantities, and therefore we simply assume the language-independent value c ~ 1, which simplifies the program considerably. This value would clearly be inappropriate for English-Chinese alignment, but it seems likely to be useful for most pairs of European languages.</Paragraph> <Paragraph position="5"> s 2 is estimated from Figure 3. The model assumes that s 2 is proportional to length. The constant of proportionality is determined by the slope of the robust regression line shown in the figure. The result for English--German is s 2 = 7.3, and for English-French is s 2 = 5.6. Again, we will see that the difference in the two slopes is not too important. Therefore, we can combine the data across languages, and adopt the simpler language-independent estimate s 2 ~ 6.8, which is what is actually used in the program.</Paragraph> <Paragraph position="6"> We now appeal to Bayes Theorem to estimate Prob(match \] 6) as a constant times</Paragraph> <Paragraph position="8"> where Prob(\]61) is the probability that a random variable, z, with a standardized (mean zero, variance one) normal distribution, has magnitude at least as large as 16\]. That is, Prob(~)- 1 f~ v~ oo e -z2/2 dz.</Paragraph> <Paragraph position="9"> The program computes 6 directly from the lengths of the two portions of text, 11 and 12, and the two parameters, c and s 2. That is, 6 = (/2 - llC)/IX/~lS 2. Then, Prob(\]6\]) is computed by integrating a standard normal distribution (with mean zero and variance one). Many statistics textbooks include a table for computing this. The code in the appendix uses the pnorm function, which is based on an approximation described by Abramowitz and Stegun (1964; p. 932, equation 26.2.17).</Paragraph> <Paragraph position="10"> The prior probability of a match, Prob(match), is fit with the values in Table 5, which were determined from the hand-marked UBS data. We have found that a sentence in one language normally matches exactly one sentence in the other language (1-1). Three additional possibilities are also considered: 1-0 (including 0-1), 2-1 (including 1-2), and 2-2. Table 5 shows all four possibilities.</Paragraph> <Paragraph position="11"> This completes the discussion of the distance measure. Prob(match I 6) is computed as an (irrelevant) constant times Prob(~ \] match)Prob(match). Prob(match) is computed using the values in Table 5. Prob(6 \] match) is computed by assuming that Prob(6 \] match) = 2(1 -Prob(\]~\]) ), where Prob(16\]) has a standard normal distribution. We first calculate 6 as (12 - llc)/Ix/~lS 2 and then Prob(\]6\[) is computed by integrating a standard normal distribution. See the c-function two_side_distance in the appendix for an example of a c-code implementation of these calculations.</Paragraph> <Paragraph position="12"> The distance function d, represented in the program as two,side_distance, is defined in a general way to allow for insertions, deletion, substitution, etc. The function takes four arguments: Xl~ Yl, x2, y2.</Paragraph> <Paragraph position="13"> 1. Let d(xl,yl, 0~ 0) be the cost of substituting Xl with yl, 2. d(xl, 0; 0, 0) be the cost of deleting Xl, 3. d(O, yl; 0, 0) be the cost of insertion of Yl, 4. d(Xl,yl; x2, 0) be the cost of contracting Xl and x2 to yl, Computational Linguistics Volume 19, Number 1 5. d(Xl,yl;O, y2) be the cost of expanding X1 to yl and y2, and 6. d(Xl,yl;x2,y2) be the cost of merging xl and x2 and matching with Yl and yR.</Paragraph> <Paragraph position="14"> 5. The Dynamic Programming Algorithm The algorithm is summarized in the following recursion equation. Let si, i = 1 ... I, be the sentences of one language, and tj, j -- 1-.. J, be the translations of those sentences in the other language. Let d be the distance function described in the previous section, and let D(i,j) be the minimum distance between sentences sl,...si and their translations tl,...tj, under the maximum likelihood alignment. D(i,j) is computed by minimizing over six cases (substitution, deletion, insertion, contraction, expansion, and merger) which, in effect, impose a set of slope constraints. That is, D(i,j) is defined by the following recurrence with the initial condition D(i,j) = O.</Paragraph> <Paragraph position="16"/> </Section> class="xml-element"></Paper>