File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1039_metho.xml

Size: 24,126 bytes

Last Modified: 2025-10-06 14:14:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1039">
  <Title>A Portable Algorithm for Mapping Bitext Correspondence</Title>
  <Section position="4" start_page="0" end_page="305" type="metho">
    <SectionTitle>
2 Bitext Geometry
</SectionTitle>
    <Paragraph position="0"> A bitext (Harris, 1988) comprises two versions of a text, such as a text in two different languages.</Paragraph>
    <Paragraph position="1"> Translators create a bitext each time they translate a text. Each bitext defines a rectangular bitext space, as illustrated in Figure 1. The width and height of the rectangle are the lengths of the two component texts, in characters. The lower left corner of the rectangle is the origin of the bitext space and represents the two texts' beginnings. The upper right corner is the terminus and represents the texts' ends. The line between the origin and the  terminus is the main diagonal. The slope of the main diagonal is the bitext slope.</Paragraph>
    <Paragraph position="2"> Each bitext space contains a number of true points of correspondence (TPCs), other than the origin and the terminus. For example, if a token at position p on the x-axis and a token at position q on the y-axis are translations of each other, then the coordinate (p, q) in the bitext space is a TPC 2. TPCs also exist at corresponding boundaries of text units such as sentences, paragraphs, and chapters.</Paragraph>
    <Paragraph position="3"> Groups of TPCs with a roughly linear arrangement in the bitext space are called chains.</Paragraph>
    <Paragraph position="4"> Bitext maps are 1-to-1 functions in bitext spaces. A complete set of TPCs for a particular bitext is called a true bitext map (TBM). The purpose of a bitext mapping algorithm is to produce bitext maps that are the best possible approximations of each bitext's TBM.</Paragraph>
  </Section>
  <Section position="5" start_page="305" end_page="307" type="metho">
    <SectionTitle>
3 SIMR
</SectionTitle>
    <Paragraph position="0"> SIMR builds bitext maps one chain at a time. The search for each chain alternates between a generation phase and a recognition phase. The generation phase begins in a small rectangular region of the bitext space, whose diagonal is parallel to the main diagonal. Within this search rectangle, SIMR generates all the points of correspondence that satisfy the supplied matching predicate, as explained in Section 3.1. In the recognition phase, SIMR calls the chain recognition heuristic to find suitable chains among the generated points. If no suitable chains are found, the search rectangle is proportionally expanded and the generation-recognition cycle 2Since distances in the bitext space are measured in characters, the position of a token is defined as the mean position of its characters.</Paragraph>
    <Paragraph position="1"> is repeated. The rectangle keeps expanding until at least one acceptable chain is found. If more than one chain is found in the same cycle, SIMR accepts the one whose points are least dispersed around its least-squares line. Each time SIMR accepts a chain, it selects another region of the bitext space to search for the next chain.</Paragraph>
    <Paragraph position="2"> SIMR employs a simple heuristic to select regions of the bitext space to search. To a first approximation, TBMs are monotonically increasing functions. This means that if SIMR finds one chain, it should look for others either above and to the right or below and to the left of the one it has just found. All SIMR needs is a place to start the trace. A good place to start is at the beginning: Since the origin of the bitext space is always a TPC, the first search rectangle is anchored at the origin. Subsequent search rectangles are anchored at the top right corner of the previously found chain, as shown in Figure 2.</Paragraph>
    <Paragraph position="3">  strategy. The search rectangle is anchored at the top right corner of the previously found chain. Its diagonal remains parallel to the main diagonal.</Paragraph>
    <Paragraph position="4"> The expanding-rectangle search strategy makes SIMR robust in the face of TBM discontinuities.</Paragraph>
    <Paragraph position="5"> Figure 2 shows a segment of the TBM that contains a vertical gap (an omission in the text on the x-axis). As the search rectangle grows, it will eventually intersect with the TBM, even if the discontinuity is quite large (Melamed, 1996b). The noise filter described in Section 3.3 prevents SIMR from being led astray by false points of correspondence.</Paragraph>
    <Section position="1" start_page="305" end_page="306" type="sub_section">
      <SectionTitle>
3.1 Point Generation
</SectionTitle>
      <Paragraph position="0"> SIMR generates candidate points of correspondence in the search rectangle using one of its matching predicates. A matching predicate is a heuristic for deciding whether a given pair of tokens are likely to be'mutual translations. Two kinds of information  that a matching predicate can rely on most often are cognates and translation lexicons.</Paragraph>
      <Paragraph position="1"> Two tokens in a bitext are cognates if they have the same meaning and similar spellings. In the non-technical Canadian Hansards (parliamentary debate transcripts available in English and in French), cognates can be found for roughly one quarter of all text tokens (Melamed, 1995). Even distantly related languages like English and Czech will share a large number of cognates in the form of proper nouns.</Paragraph>
      <Paragraph position="2"> Cognates are more common in bitexts from more similar language pairs, and from text genres where more word borrowing occurs, such as technical texts. When dealing with language pairs that have dissimilar alphabets, the matching predicate can employ phonetic cognates (Melamed, 1996a). When one or both of the languages involved is written in pictographs, cognates can still be found among punctuation and digit strings. However, cognates of this last kind are usually too sparse to suffice by themselves. null When the matching predicate cannot generate enough candidate correspondence points based on cognates, its signal can be strengthened by a translation lexicon. Translation lexicons can be extracted from machine-readable bilingual dictionaries (MRBDs), in the rare cases where MRBDs are available. In other cases, they can be constructed automatically or semi-automatically using any of several methods (Fung, 1995; Melamed, 1996c; Resnik &amp; Melamed, 1997). Since the matching predicate need not be perfectly accurate, the translation lexicons need not be either.</Paragraph>
      <Paragraph position="3"> Matching predicates can take advantage of other information, besides cognates and translation lexicons can also be used. For example, a list of faux amis is a useful complement to a cognate matching strategy (Macklovitch, 1995). A stop list of function words is also helpful. Function words are translated inconsistently and make unreliable points of correspondence (Melamed, 1996a).</Paragraph>
    </Section>
    <Section position="2" start_page="306" end_page="306" type="sub_section">
      <SectionTitle>
3.2 Point Selection
</SectionTitle>
      <Paragraph position="0"> As illustrated in Figure 2, even short sequences of TPCs form characteristic patterns. Most chains of TPCs have the following properties:  * Linearity: TPCs tend to line up straight. * Low Variance of Slope: The slope of a TPC chain is rarely much different from the bitext slope.</Paragraph>
      <Paragraph position="1"> * Injectivity: No two points in a chain of TPCs  can have the same x- or y-co-ordinates. SIMR's chain recognition heuristic exploits these properties to decide which chains in the search rectangle might be TPC chains.</Paragraph>
      <Paragraph position="2"> The heuristic involves three parameters: chain size, maximum point dispersal and maximum angle deviation. A chain's size is simply the number of points it contains. The heuristic considers only chains of exactly the specified size whose points are injective. The linearity of the these chains is tested by measuring the root mean squared distance of the chain's points from the chain's least-squares line. If this distance exceeds the maximum point dispersal threshold, the chain is rejected. Next, the angle of each chain's least-squares line is compared to the arctangent of the bitext slope. If the difference exceeds the maximum angle deviation threshold, the chain is rejected. These filters can be efficiently combined so that SIMR's expected running time and memory requirements are linear in the size of the input bitext (Melamed, 1996a).</Paragraph>
      <Paragraph position="3"> The chain recognition heuristic pays no attention to whether chains are monotonic. Non-monotonic TPC chains are quite common, because even languages with similar syntax like French and English have well-known differences in word order. For example, English (adjective, noun) pairs usually correspond to French (noun, adjective) pairs. Such inversions result in TPCs arranged like the middle two points in the &amp;quot;previous chain&amp;quot; of Figure 2. SIMR has no problem accepting the inverted points.</Paragraph>
      <Paragraph position="4"> If the order of words in a certain text passage is radically altered during translation, SIMR will simply ignore the words that &amp;quot;move too much&amp;quot; and construct chains out of those that remain more stationary. The maximum point dispersal parameter limits the width of accepted chains, but nothing limits their length. In practice, the chain recognition heuristic often accepts chains that span several sentences. The ability to analyze non-monotonic points of correspondence over variable-size areas of bitext space makes SIMR robust enough to use on translations that are not very literal.</Paragraph>
    </Section>
    <Section position="3" start_page="306" end_page="307" type="sub_section">
      <SectionTitle>
3.3 Noise Filter
</SectionTitle>
      <Paragraph position="0"> Points of correspondence among frequent token types often line up in rows and columns, as illustrated in Figure 3. Token types like the English article &amp;quot;a&amp;quot; can produce one or more correspondence points for almost every sentence in the opposite text.</Paragraph>
      <Paragraph position="1"> Only one point of correspondence in each row and column can be correct; the rest are noise. A noise filter can make it easier for SIMR to find TPC chains.</Paragraph>
      <Paragraph position="2"> Other bitext mapping algorithms mitigate this source of noise either by assigning lower weights to  respondence that line up in rows and columns.</Paragraph>
      <Paragraph position="3"> correspondence points associated with frequent token types (Church, 1993) or by deleting frequent token types from the bitext altogether (Dagan et al., 1993). However, a token type that is relatively frequent overall can be rare in some parts of the text. In those parts, the token type can provide valuable clues to correspondence. On the other hand, many tokens of a relatively rare type can be concentrated in a short segment of the text, resulting in many false correspondence points. The varying concentration of identical tokens suggests that more localized noise filters would be more effective. SIMR's localized search strategy provides a vehicle for a localized noise filter.</Paragraph>
      <Paragraph position="4"> The filter is based on the maximum point ambiguity level parameter. For each point p = (x, y), lct X be the number of points in column x within the search rectangle, and let Y be the number of points in row y within the search rectangle. Then the ambiguity level of p is X + Y - 2. In particular, if p is the only point in its row and column, then its ambiguity level is zero. The chain recognition heuristic ignores points whose ambiguity level is too high. What makes this a localized filter is that only points within the search rectangle count toward each other's ambiguity level. The ambiguity level of a given point can change when the search rectangle expands or moves.</Paragraph>
      <Paragraph position="5"> The noise filter ensures that false points of correspondence are very sparse, as illustrated in Figure 4. Even if one chain of false points of correspondence slips by the chain recognition heuristic, the expanding rectangle will find its way back to the TBM before the chain recognition heuristic accepts another  are much more dense than false points of correspondence A good signal-to-noise ratio prevents SIMR from getting lost.</Paragraph>
      <Paragraph position="6"> chain. If the matching predicate generates a reasonably strong signal then the signal-to-noise ratio will be high and SIMR will not get lost, even though it is a greedy algorithm with no ability to look ahead.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="307" end_page="308" type="metho">
    <SectionTitle>
4 Porting to New Language Pairs
</SectionTitle>
    <Paragraph position="0"> SIMR can be ported to a new language pair in three steps.</Paragraph>
    <Section position="1" start_page="307" end_page="307" type="sub_section">
      <SectionTitle>
4.1 Step 1: Construct Matching Predicate
</SectionTitle>
      <Paragraph position="0"> The original SIMR implementation for French/English included matching predicates that could use cognates and/or translation lexicons. For language pairs in which lexical cognates are frequent, a cognate-based matching predicate should suffice.</Paragraph>
      <Paragraph position="1"> In other cases, a &amp;quot;seed&amp;quot; translation lexicon may be used to boost the number of candidate points produced in the generation phase of the search. The SIMR implementation for Spanish/English uses only cognates. For Korean/English, SIMR takes advantage of punctuation and number cognates but supplements them with a small translation lexicon.</Paragraph>
    </Section>
    <Section position="2" start_page="307" end_page="308" type="sub_section">
      <SectionTitle>
4.2 Step 2: Construct Axis Generators
</SectionTitle>
      <Paragraph position="0"> In order for SIMR to generate candidate points of correspondence, it needs to know what token pairs correspond to co-ordinates in the search rectangle.</Paragraph>
      <Paragraph position="1"> It is the axis generator's job to map the two halves of the bitext to positions on the x- and y-axes of the bitext space, before SIMR starts searching for chains. This mapping should be done with the matching predicate in mind.</Paragraph>
      <Paragraph position="2"> If the matching predicate uses cognates, then every word that might have a cognate in the other half of the bitext should be assigned its own axis  position. This rule applies to punctuation and numbers as well as to &amp;quot;lexical&amp;quot; cognates. In the case oflexical cognates, the axis generator typically needs to invoke a language-specific tokenization program to identify words in the text. Writing such a program may constitute a significant part of the porting effort, if no such program is available in advance. The effort may be lessened, however, by the realization that it is acceptable for the tokenization program to overgenerate just as it is acceptable for the matching predicate. For example, when tokenizing German text, it is not necessary for the tokenizer to know which words are compounds. A word that has another word as a substring should result in one axis position for the substring and one for the superstring. null When lexical cognates are not being used, the axis generator only needs to identify punctuation, numbers, and those character strings in the text which also appear on the relevant side of the translation lexicon 3. It would be pointless to plot other words on the axes because the matching predicate could never match them anyway. Therefore, for languages like Chinese and Japanese, which are written without spaces between words, tokenization boils down to string matching. In this manner, SIMR circumvents the difficult problem of word identification in these languages.</Paragraph>
    </Section>
    <Section position="3" start_page="308" end_page="308" type="sub_section">
      <SectionTitle>
4.3 Step 3: Re-optimize Parameters
</SectionTitle>
      <Paragraph position="0"> The last step in the porting process is to re-optimize SIMR's numerical parameters. The four parameters described in Section 3 interact in complicated ways, and it is impossible to find a good parameter set analytically. It is easier to optimize these parameters empirically, using simulated annealing (Vidal, 1993).</Paragraph>
      <Paragraph position="1"> Simulated annealing requires an objective function to optimize. The objective function for bitext mapping should measure the difference between the TBM and maps produced with the current parameter set. In geometric terms, the difference is a distance. The TBM consists of a set of TPCs. The error between a bitext map and each TPC can be defined as the horizontal distance, the vertical distance, or the distance perpendicular to the main diagonal. The first two alternatives would minimize the error with respect to only one language or the other. The perpendicular distance is a more robust average. In order to penalize large errors more heavily, root mean squared (RMS) distance is minimized instead of mean distance.</Paragraph>
      <Paragraph position="2"> 3Multi-word expressions in the translation lexicon are treated just like any other character string.</Paragraph>
      <Paragraph position="3"> The most tedious part of the porting process is the construction of TBMs against which SIMR's parameters can be optimized and tested. The easiest way to construct these gold standards is to extract them from pairs of hand-aligned text segments: The final character positions of each segment in an aligned pair are the co-ordinates of a TPC. Over the course of two porting efforts, I have develol~ed and refined tools and methods that allow a bilingual annotator to construct the required TBMs very efficiently from a raw bitext. For example, a tool originally designed for automatic detection of omissions in translations (Melamed, 1996b) was adopted to detect misalignments. null</Paragraph>
    </Section>
    <Section position="4" start_page="308" end_page="308" type="sub_section">
      <SectionTitle>
4.4 Porting Experience Summary
</SectionTitle>
      <Paragraph position="0"> Table 1 summarizes the amount of time invested in each new language pair. The estimated times for building axis generators do not include the time spent to build the English axis generator, which was part of the original implementation. Axis generators need to be built only once per language, rather than once per language pair.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="308" end_page="309" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> SIMR was evaluated on hand-aligned bitexts of various genres in three language pairs. None of these test bitexts were used anywhere in the training or porting procedures. Each test bitext was converted to a set of TPCs by noting the pair of character positions at the end of each aligned pair of text segments. The test metric was the root mean squared distance, in characters, between each TPC and the interpolated bitext map produced by SIMR, where the distance was measured perpendicular to the main diagonal.</Paragraph>
    <Paragraph position="1"> The results are presented in Table 2.</Paragraph>
    <Paragraph position="2"> The French/English part of the evaluation was performed on bitexts from the publicly available BAF corpus created at CITI (Simard &amp; Plamondon, 1996). SIMR's error distribution on the &amp;quot;parliamentary debates&amp;quot; bitext in this collection is given in  distributions reported in (Church, 1993) and in (Dagan et al., 1993). SIMR's RMS error on this bitext was 5.7 characters. Church's char_align algorithm (Church, 1993) is the only algorithm that does not use sentence boundary information for which comparable results have been reported, char_align's RMS error on this bitext was 57 characters, exactly ten times higher.</Paragraph>
    <Paragraph position="3"> Two teams of researchers have reported results on the same &amp;quot;parliamentary debates&amp;quot; bitext for algorithms that map correspondence at the sentence level (Gale &amp; Church, 1991a; Simard et al., 1992).</Paragraph>
    <Paragraph position="4">  French/English &amp;quot;parliamentary debates&amp;quot; bitext. number of error range fraction of test points in characters test points  Both of these algorithms use sentence boundary information. Melamed (1996a) showed that sentence boundary information can be used to convert SIMR's output into sentence alignments that are more accurate than those obtained by either of the other two approaches.</Paragraph>
    <Paragraph position="5"> The test bitexts in the other two language pairs were created when SIMR was being ported to those languages. The Spanish/English bitexts were drawn from the on-line Sun MicroSystems Solaris AnswerBooks. The Korean/English bitexts were provided and hand-aligned by Young-Suk Lee of MIT's Lincoln Laboratories. Although it is not possible to compare SIMR's performance on these language pairs to the performance of other algorithms, Table 2 shows that the performance on other language pairs is no worse than performance on French/English.</Paragraph>
  </Section>
  <Section position="8" start_page="309" end_page="310" type="metho">
    <SectionTitle>
6 Which Text Units to Map?
</SectionTitle>
    <Paragraph position="0"> Early bitext mapping algorithms focused on sentences (Kay &amp; RSscheisen, 1993; Debili &amp; Sammouda, 1992). Although sentence maps do not have sufficient resolution for some important bitext applications (Melamed, 1996b; Macklovitch, 1995), sentences were an easy starting point, because their order rarely changes during translation. Therefore, sentence mapping algorithms need not worry about crossing correspondences. In 1991, two teams of researchers independently discovered that sentences can be accurately aligned by matching sequences  with similar lengths (Gale &amp; Church, 1991a; Brown et al., 1991).</Paragraph>
    <Paragraph position="1"> Soon thereafter, Church (1993) found that bitext mapping at the sentence level is not an option for noisy bitexts found in the real world. Sentences are often difficult to detect, especially where punctuation is missing due to OCR errors. More importantly, bitexts often contain lists, tables, titles, footnotes, citations and/or mark-up codes that foil sentence alignment methods. Church's solution was to look at the smallest of text units -- characters -- and to use digital signal processing techniques to grapple with the much larger number of text units that might match between the two halves of a bitext. Characters match across languages only to the extent that they participate in cognates. Thus, Church's method is only applicable to language pairs with similar alphabets.</Paragraph>
    <Paragraph position="2"> The main insight of the present work is that words are a happy medium-sized text unit at which to map bitext correspondence. By situating word positions in a bitext space, the geometric heuristics of sentence alignment algorithms can be exploited equally well at the word level. The cognate heuristic of the character-based algorithms works better at the word level, because cognateness can be defined more precisely in terms of words, e.g. using the Longest Common Subsequence Ratio (Melamed, 1995). Several other matching heuristics can only be applied at the word level, including the localized noise filter in Section 3.3, lists of stop words and lists of/aux amis (Macklovitch, 1995). Most importantly, translation lexicons can only be used at the word level. SIMR can employ a small hand-constructed translation lexicon to map bitexts in any pair of languages, even when the cognate heuristic is not applicable and sentences cannot be found. The particular combination of heuristics described in Section 3 can certainly be improved on, but research into better bitext mapping algorithms is likely to be most fruitfull at the word level.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML