File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1010_metho.xml
Size: 20,203 bytes
Last Modified: 2025-10-06 14:14:01
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1010"> <Title>Text Alignment in the Real World: Improving Alignments of Noisy Translations Using Common Lexical Features, String Matching Strategies and N-Gram Comparisons ~</Title> <Section position="3" start_page="67" end_page="68" type="metho"> <SectionTitle> 2 A General Approach </SectionTitle> <Paragraph position="0"> The byte-length ratio methods are very general in that they rely only upon a heuristic segmentation procedure to divide a text into sentence-level chunks.</Paragraph> <Paragraph position="1"> Although determining sentence boundaries can be problematic across languages, simple assumptions appear to work well even for comparisons between European and Oriental languages, primarily because the segmentation heuristic is uniformly applied to each document and therefore an &quot;undersegmented&quot; section can combine together to match with single blocks in the opposite language as necessary.</Paragraph> <Paragraph position="2"> Less general would be a method that relied on deep analysis of the source texts to determine appropriate boundaries for alignment blocks. A model that accounted for all of the formatting discrepancies, comparative rescalings of sentence or phrase length due to the economy of the language expression, and other properties that may define a corpus, will not necessarily be appropriate to other corpora or to text in general.</Paragraph> <Paragraph position="3"> We chose to remain as general as possible with our investigations of alignment methods. In particular, the heuristics for text segmentation regarded periods followed by a space, a newline (paragraph boundaries) or a tab as a sentence boundary for both English and Spanish texts (Figure 1). Multiple periods separated by spaces were ignored for alignment segmentation to It describes the situation of malaria in Region in 1990, summarizing the information obtained from the Governments in response to the questionnaire sent to them annually.</Paragraph> <Paragraph position="4"> allow for ellipsis. This approach did not, therefore, regard many abbreviations as a unique class of textual event. The end result was an extremely simplistic segmentation.</Paragraph> </Section> <Section position="4" start_page="68" end_page="68" type="metho"> <SectionTitle> 3 The PAHO Corpus: Noisy, </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="68" end_page="68" type="sub_section"> <SectionTitle> Problematic Texts </SectionTitle> <Paragraph position="0"> The PAHO texts serve as an important counterpart to our translator's workstation, Norm (Ogden, 1993). During the translation process, translators can access many different resources including a variety of on-line dictionaries, reference works and parallel texts. The parallel texts include examples of translations that different translators have compiled in the past and serve as a series of examples of how to translate words and phrases for a particular context. The PAHO texts also serve as a basis for our multi-lingual information retrieval system (Davis and Dunning, 1995; Dunning and Davis, 1993a, 1993b). The need for robust strategies to process and align large parallel corpora automatically is therefore a critical component of our ongoing research.</Paragraph> <Paragraph position="1"> In the PAHO corpus, many of the texts are wellbehaved, with similar tokenization at the boundaries delineating paragraphs. But some are extremely noisy, with added text in the English or Spanish document that lacks a counterpart in the parallel document. Formatting conventions differ in many cases, with multiple periods delimiting contents listings in one, while spaces serve a similar role in the other, or tables and reference formats differing between the two texts. Another formatting problem is the addition of significant runs of whitespace and newlines that simply do not occur in the parallel text. The document pair shown in Figure 1 is representative of the quality of the PAHO texts.</Paragraph> </Section> </Section> <Section position="5" start_page="68" end_page="68" type="metho"> <SectionTitle> 4 Features and Alignment </SectionTitle> <Paragraph position="0"> One of the most striking features of English-Spanish translations is the fact that native English speakers with little knowledge of Spanish appear able to identify parallel texts with remarkable accuracy.</Paragraph> <Paragraph position="1"> The reason appears to be the large number of cognate terms that Spanish and English translations share, especially technical terms, and other lexical features such as numbers and proper names that may appear with similar placement and frequency across two parallel texts. The work by Simard, Foster and Isabelle (1993) as well as Church (1993) demonstrated that cognate-matching strategies can be highly effective in aligning text. Native English speakers with limited Spanish appear to be capable of aligning even noisy texts like many of the PAHO documents, with difficulty causing a decrease in speed of alignment, rather than decreased accuracy of the alignment. From these observations, we examined five different sources of information for alignment discrimination: The analyses of each of these information sources are presented in sections 5.1 through 5.5.</Paragraph> <Paragraph position="2"> For each method, a hand-aligned document from PAHO corpus that was problematic for byte-ratio methods was used for evaluation, first for comparing the method's score distribution between random blocks and the hand-aligned set, then for performing realignments of the documents. The document was quite long for the PAHO set, containing about 1400 lines of text and 360 alignment blocks in the English document and 1000 lines and 297 blocks in the Spanish text. In these particular documents, the English text had nearly 400 lines of extraneous data abutted to the end of it that was not in the Spanish document, increasing the error potential for byte-length methods.</Paragraph> </Section> <Section position="6" start_page="68" end_page="72" type="metho"> <SectionTitle> 5 Improving Alignments </SectionTitle> <Paragraph position="0"> We used a modified and extended version of Gale and Church's byte-ratio algorithm (199l) as a basis for an improved alignment algorithm. The standard algorithm derives a penalty from a ratio of the byte lengths of two potential aligned blocks and augments the penalty with a factor based on the frequency of matches between blocks in one language that equate to a block or blocks in the second language. The byte-ratio penalty is the measurement-conditioned or a posteriori probability of a match while the frequency of block matches gives the a priori probability of the same match. Our version of the basic algorithm differs in the mechanics of memory management (we use skiplists to improve performance of the dynamic programming, for example), includes both positive and negative information about the probability of a given match and fuses multiple sources of information for evaluating alignment probabilities.</Paragraph> <Paragraph position="1"> For two documents, D l and D 2 , consisting ofn and m alignment blocks, respectively, a i < n and by ~ m' an alignment, A, is a set consisting of ai'&quot; &quot;ai + l ~-4 bj...bj +p pairs. For compactness, we will write this as cti, l *~ flj, p.</Paragraph> <Paragraph position="3"> Following the Gale and Church approach, we choose an alignment that maximizes the probability over all possible alignments: arg r~ax \[P(A IDv D2)\] If we assume that the probabilities of individually aligned block pairs in an alignment are independent, the above equation becomes:</Paragraph> <Paragraph position="5"> Further assuming that the individual probabilities of aligning two blocks, e(ai, t~-~fJj, pIDvD2), are dependent on features in the text described by a series of feature scores, 8 k, the above equations expands into Equation 1 in figure 2.</Paragraph> <Paragraph position="6"> Now, for each of the feature scoring functions, the a posteriori probabilities can be calculated from Bayes' Rule as shown in Equation 2, Figure 2 which, given an approximation of the joint a posteriori probabilities by assuming independence, produces Equation 3, Note that the term in the denominator of Equation 3 reflects both the statistics of the positive and negative information for the alignment. In Gale and Church's original work, the denominator term was assumed to be a constant over the range of 8, and therefore could be safely ignored during the maximization of probabilities over the alignment set. In reality, this assumption is true only in the case of a uniform distribution of P (Sk\]~ (0~ i l ~ \[~j p) ) ' and is perhaps not even true in that Case due to ti{e scaling properties of the logarithm when the maximization problem above is converted to a minimization problem (below).</Paragraph> <Paragraph position="7"> In any case, the probability of a given value of fi occurring is not merely dependent on the probability of that score in the hand-aligned set, but is dependent on the comparative probabilities of the score for the hand aligned set and a set of randomly chosen alignment blocks. Clearly, if a value of 5 is equally likely for both the hand aligned and random sets, then the measurement cannot contribute to the decision process. Equation3 presents a very general approach to the fusion of multiple sources of information about alignment probabilities. Each of the sources contributes to the overall probability of an alignment, but is in turn scaled by the total probability of a given score occurring over the entire set of possible alignments.</Paragraph> <Paragraph position="8"> We can convert the maximization of probabilities into a minimization of penalties taking the negative logarithm of Equation 2 and substituting Equation 3, where 01 , 0 2 and 03 are as given in figure 2, Equations 5, 6 and 7. Equation 8 in the same figure is the result.</Paragraph> <Paragraph position="9"> The feature functions, 5 k , are derived from estimates of the probability of byte length differences, number matching score probabilities and string match score probabilities in our approach.</Paragraph> <Paragraph position="10"> The Bayesian prior, P(O~i,l<--.->~j,p), can be estimated as per Gale and Church (1991) by assuming that it is equal to the frequency of distinct n-m matches in the training set.</Paragraph> <Section position="1" start_page="70" end_page="71" type="sub_section"> <SectionTitle> 5.1 Byte-length Ratios, 5 I </SectionTitle> <Paragraph position="0"> The probability of an alignment based on byte-length ratios is P(51 Ctl t ~-~ 13j p) = P(Sl(l(ct I t) l(~j p))), I . &quot;, ', ~ , where l ( ) is the byte-length function. The distribution is assumed to be a Gaussian random variable derived from the block length differences in the hand aligned set. Following Gale and Church (1991), the slope of the average of the length differences describes the average number of Spanish characters generated per English character. Assuming that the distribution is approximately Gaussian, we can normalize it to mean 0 and variance 1, resulting in: ~il(/(deg~i, t), l(pj p)) = l(pj, p) - l(txi, t)c ' ~/(~i, t) 02 where c = E(I(~j, p)/l(o~i, l) ) = 0.99 and 02 ~ O. 16 is the observed variance. The histogram in Figure 3a shows the actual distribution of the hand-aligned data set.</Paragraph> <Paragraph position="1"> The shape of the histogram is approximately Gaussian. The distribution of the corresponding random segments in shown in Figure 3b. Note that the distribution of the random set has significantly higher standard deviation than the corresponding hand aligned set. This diagram, as well as Figure 4 for the n-gram approach on the following page, indicate the statistical quality of the information provided by the scores. Good sources of information would produce a marked difference between the two distributions. For comparatively poor sources of information, the distributions would show little or no differences.</Paragraph> <Paragraph position="2"> 5.2 4-gram Matching, 5 2 Cognates in English and Spanish often have short runs of letters in common. A measure that counts the number of matching n-grams in two strings is an unordered comparison of similarities within the strings. In this way, runs of letters in common between cognate terms are measured. We used an efficient n-gram matching algorithm that requires a single scan of each string, followed by two sorts and a linear-time list comparison to count matches. The resulting score was normalized by the total number of n-grams between the strings. Formally, for two strings elee....e p and sls2...s q, the n-gram match count, K n, is given by:</Paragraph> <Paragraph position="4"> where mO is the matching function. The function, m(), is equal to 1 only for equivalent n-grams, else it is 0.</Paragraph> <Paragraph position="5"> We chose to use 4-gram scores for the alignment algorithm, 52 = K 4 . The distributions of the 4-gram Distribution of Byte-length Ratio Scores for the Hand Aligned Set counts were computed for both the hand-aligned and random alignment blocks. Figure 4 shows the resulting distributions. The results suggest that, on the whole, the use of n-gram methods should be considered for improving alignments that contain lexically similar cognates. Being unordered comparisons, however, they cannot exploit any intrinsic sequencing of lexical elements.</Paragraph> </Section> <Section position="2" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 5.3 Ordered String Comparisons, 8 3 </SectionTitle> <Paragraph position="0"> The value of unordered comparisons like the n-gram matching may be enhanced by ordered comparisons.</Paragraph> <Paragraph position="1"> An ordered comparison can reduce the noise associated with matching unrelated n-grams at opposite ends of parallel alignment blocks. We chose to evaluate a simple string-matching scheme as a possible method for improving alignment performance. The scheme compares the two alignment blocks characterby-character, skipping over sections in one block that do not match in the opposite, thus primarily penalizing the inclusion of dissimilar text segments in either block. The resulting sum of the matches is scaled by the sum of the lengths of the two blocks. In comparison with the random block scoring, the distribution of the hand aligned data set had a greater number of matches with high string-match scores.</Paragraph> </Section> <Section position="3" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 5.4 Number Matching, 8 4 </SectionTitle> <Paragraph position="0"> The PAHO texts are distinguished by a number of textual features, especially the fact that they are all in some way related to Latin American health issues.</Paragraph> <Paragraph position="1"> The preponderance of the documents are technical reports on epidemiology, proceedings from meetings and conferences, and compendiums of resources and citations. Within these documents, numbers occur regularly. The string-matching technique suggested that if a class of lexical distinction could be matched directly, the alignments might be significantly improved. Numbers are sufficiently general that we felt we were not violating the spirit of the restriction on generality by using a number-matching scheme.</Paragraph> <Paragraph position="2"> For each alignment block pair, the number matching algorithm extracted all numbers. The total number of exact matches between the number sets from each alignment block was then normalized by the sizes of both sets of numbers. This approach has several drawbacks, such as the differences in the format of numbers between Spanish and English. In Spanish, for example, commas are used instead of decimal points.</Paragraph> <Paragraph position="3"> These distinctions were ignored, however, to preserve the generality of the algorithm. This generality will potentially extend to other languages, including Asiatic languages, which tend to use Arabic numerals to represent numbers. The distributions of both the hand and random block scoring both showed a substantial mass of very low scores.</Paragraph> <Paragraph position="4"> It should be noted that numbers are simply a special case of cognates and certainly contribute to the n-gram scores. Adding in number matching strategies therefore only enhances the n-gram results.</Paragraph> </Section> <Section position="4" start_page="71" end_page="72" type="sub_section"> <SectionTitle> 5.5 Translation Residues </SectionTitle> <Paragraph position="0"> Despite the fact that non-Spanish speakers can often achieve success at aligning English documents with Spanish texts, the added knowledge of someone with both Spanish and English language understanding is an added benefit and should facilitate alignment. To evaluate the role of translation-based alignment scor- null ing, the Collins Spanish-English and English-Spanish bilingual dictionaries were used to produce a score equal to the residue from a translation attempt of the terms in potential aligning blocks.</Paragraph> <Paragraph position="1"> Given a set of English terms, e i, Spanish terms, sj, from two blocks, the translation operation, T(I), generates a set of terms in the opposite language by stemming each term and retrieving the terms that the stemmed word translates to in Collins. The residue, R, is then a penalty equal to the (normalized) number of terms in each translation set that do not have a match in the opposite translation set: R =1 EkllZC/lPc tgll ZUT(lk) UlkU In comparison test, the distributions of scores between random Spanish blocks and English blocks, and between the hand-aligned sets, were surprisingly similar, making a statistical discrimination of proper alignments difficult. We believe that dictionary-based discrimination performs poorly primarily due to the noisy nature of the dictionary we used. It was initially thought that subsenses and usage patterns for each term would be an aid to discrimination by providing a stronger basis for matches between true parallel blocks. The added terms beyond the critical primary sense in the dictionary had high hit rates with usage terms throughout the dictionary. The result was a noisy translation set that robbed the residue measure of discriminatory power. The results discouraged us from including the R measure in the error function for the dynamic programming system, although we suspect that improved dictionaries may ultimately provide better discrimination. It may also be possible to apply a kill list to the dictionary to reduce the number of high frequency terms in each definition, increasing the relevancy of the overall residue measure.</Paragraph> </Section> </Section> <Section position="7" start_page="72" end_page="72" type="metho"> <SectionTitle> 6 Implementation </SectionTitle> <Paragraph position="0"> The fact that our formulation of the alignment probability for two blocks is dependent on both the positive and negative information about the alignment probability means that the probability density functions can be used directly in the algorithm. Specifically, the distributions shown in Figures 3 and 4, as well as the distributions for ordered string comparisons and number comparisons, were loaded into the algorithm as histograms. During the dynamic programming operation, probability scores were determined by direct look-up of the 8 scores in the appropriate histogram, with some weighted averaging performed for values between the boundaries of the histogram bars for smoothing. This approach eliminated the necessity of estimating a distribution function for the rather non-Gaussian functions that are assumed to underlay the experimental data. Using this approach, the byte-length ratios could be simplified by not assuming a Gaussian-like distribution and directly using the histograms of byte-length probabilities. For comparison, however, we chose to use the Gale and Church derivation without modifying 81 .</Paragraph> </Section> class="xml-element"></Paper>