File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/w96-0201_evalu.xml
Size: 11,079 bytes
Last Modified: 2025-10-06 14:00:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0201"> <Title>A Geometric Approach to Mapping Bitext Correspondence</Title> <Section position="7" start_page="109000" end_page="109000" type="evalu"> <SectionTitle> 4. Alignment </SectionTitle> <Paragraph position="0"> SIMR has no idea that words are often used to make sentences. It just outputs a series of corresponding token positions, leaving users free to draw their own conclusions about how the texts' larger units correspond. However, many existing translators' tools and machine translation strategies are based on aligned sentences. What can SIMR do for them? There are several papers in the literature about bitext alignment. The algorithms that seem to work best rely on the high correlation between the lengths of corresponding sentences (Brown et al. 1991, Gale & Church 1991). However, these algorithms can fumble in bitext sections that contain many sentences of very similar length, like this vote record: The only way to ensure a correct alignment in such regions is to look at the words. For this reason, Chen (1993) adds a statistical translation model to the Brown et al. alignment algorithm, and Wu (1994) adds a translation lexicon to the Gale & Church alignment algorithm.</Paragraph> <Paragraph position="1"> A set of points of correspondence leads to alignment more directly than a translation model or a translation lexicon, because points of correspondence are a relation between token instances, not between token types. Moreover, a set of correspondence points, supplemented with sentence boundary information, expresses sentence correspondence, which is a richer representation than sentence alignment. Figure 9 illustrates how sentence boundaries form a grid over the bitext space 3. Each cell in the grid represents the intersection of two sentences, one from each component text. A point of correspondence inside cell (X,y) indicates that some token in sentence X corresponds with some token in sentence y; i.e. sentences X and y correspond. Thus, Figure 9 indicates that sentence e corresponds with sentences G and H.</Paragraph> <Paragraph position="2"> In contrast to a correspondence relation, &quot;an alignment is a segmentation of the two texts such that the nth segment of one text is the translation of the nth segment of the other.&quot; (Simard et al. 1992) For example, given the token correspondences in Figure 9, the segment (G, H) should be aligned with the segment (e, f). If sentences (Xi,...,Xn) align with sentences (yl,...,y,~), then ((X1,...,X,~),(yl,...,y,~)) is an aligned block. In geometric terms, aligned blocks are rectangular regions of the bitext space, such that the sides of the rectangles coincide with sentence boundaries, and such that no two rectangles overlap either vertically or horizontally. The aligned blocks in Figure 9 are outlined with solid lines.</Paragraph> <Paragraph position="3"> SIMR's initial output has more expressive power than the alignment that can be derived from it. One illustration of this difference is that sentence correspondence can express inversions, but sentence alignment cannot. Inversions occur surprisingly often in real bitexts, even for sentence-size text units. Figure 9 provides another illustration. If, instead of the point in cell (I-I,e), there was a point in cell (G,f), the correct alignment for that region would still be ((G, g), (e, f)). If there were points of correspondence in both (HI,e) and (G,f), the correct alignment would still be the same. Yet, the three cases are clearly different. If a lexicographer wanted to see a word in sentence G in its bilingual context, it would be useful to know whether sentence f is relevant.</Paragraph> <Paragraph position="4"> Converting from sentence correspondence to sentence alignment is of dubious practical value.</Paragraph> <Paragraph position="5"> Nevertheless, in order to facilitate comparison of the geometric approach with other alignment algorithms, I have designed the Geometric Sentence Alignment (GSA) algorithm to reduce 3The techniques presented in this section can be applied equally well to paragraphs, lists of items, or any other text units for which boundary information is available.</Paragraph> <Paragraph position="7"> of two sentences, one from each component text. A point of correspondence inside cell (X, y) indicates that some token in sentence X corresponds with some token in sentence y; i.e. the sentences X and y correspond.</Paragraph> <Paragraph position="8"> So, for example, sentence E corresponds with sentence d. The aligned blocks are outlined with solid lines.</Paragraph> <Paragraph position="9"> sets of correspondence points to alignments. The algorithm's first step is to perform a transitive closure over the input correspondence relation. For instance, if the input contains (G,e), (H,e), and (H,f), then GSA adds the pairing (G,f). Next, GSA forces all segments to be contiguous: If sentence Y corresponds with sentences x and z, but not y, the pairing (Y,y) is added. In geometric terms, these two operations arrange all cells that contain points of correspondence into non-overlapping rectangles, while adding as few cells as possible. The result is an alignment relation.</Paragraph> <Paragraph position="10"> A complete set of TPCs, together with appropriate boundary information, guarantees a perfect alignment. Alas, the points of correspondence postulated by SIMR are neither complete nor noisefree. Fortunately, the noise in SIMR's output causes alignment errors in very predictable ways.</Paragraph> <Paragraph position="11"> GSA employs a couple of backing-off heuristics to elimninate most of the errors.</Paragraph> <Paragraph position="12"> SIMR makes errors of omission and errors of commission. Typical errors of commission are stray points of correspondence like the one in cell (H, e) in Figure 9. This point indicates that (G, H) and (e, f) should form a 2x2 aligned block, whereas the lengths of the component sentences suggest that a pair of lxl blocks is more likely. In a separate development bitext, I have found that SIMR is usually wrong in these cases. To combat such errors, GSA re-aligns any aligned block that is not lxl, using the Gale & Church length-based alignment algorithm (Gale & Church 1991, Simard 1995). Whenever the component sentence lengths suggest a more fine-grained alignment, SIMR's output is not trusted.</Paragraph> <Paragraph position="13"> Typical errors of omission are illustrated in Figure 9 by the complete absence of correspondence points between sentences (B,C,D) and (b, c). This block of sentences is sandwiched between aligned blocks. It is highly likely that at alignment that is missing from the test alignment.</Paragraph> <Paragraph position="14"> errors, given errors, not given bitext algorithm hard constraints hard constraints least some of these sentences are mutual translations, despite SIMR's failure to find any points of correspondence between them. Therefore, GSA treats all empty blocks just like aligned blocks. If an empty block is not lxl, GSA re-aligns it using a length-based algorithm, just like it would re-align any other many-to-many aligned block.</Paragraph> <Paragraph position="15"> The most difficult problem occurs when an error of omission occurs next to an error of commission, like in blocks ((), (h)) and ((J, K), (i)). If the point in cell (J,i) should really be in cell (J,h), re-alignment inside the erroneous blocks would not solve the problem. A naive solution is to merge these blocks and then to re-align them using a length-based method. Unfortunately, this kind of alignment pattern, i.e. 0xl followed by 2xl, is surprisingly often correct. Length-based methods assign very low probabilities to such pattern sequences and usually get them wrong. Therefore, GSA also considers the confidence level with which the length-based alignment algorithm reports its re-alignment. If this confidence level is sufficiently high, GSA accepts the length-based re-alignment; otherwise, the alignment indicated by SIMR's points of correspondence is retained.</Paragraph> <Paragraph position="16"> The minimum confidence at which GSA trusts the length-based re-alignment is a GSA parameter, which has been optimized on a separate development bitext.</Paragraph> <Paragraph position="17"> Due to the paucity of development resources at my disposal, GSA's backing-off heuristics are somewhat ad hoc. Even so, GSA performs at least as well as other alignment algorithms, and usually better. Table 2 compares SIMR's accuracy on the &quot;easy&quot; and &quot;hard&quot; reference bitexts with the accuracy of two other alignment algorithms, as reported by Simard et al. (1992). The error metric counts one error for each aligned block in the reference alignment that is missing from the test alignment. The hard constraints correspond to paragraph boundaries.</Paragraph> <Paragraph position="18"> More important than GSA's current performance is GSA's potential performance. With a bigger development bitext, more effective backing-off heuristics can be developed. More precise input would also make a big difference: GSA's performance will improve whenever SIMR's performance improves.</Paragraph> <Paragraph position="19"> Although GSA sometimes backs off to a quadratic-time alignment algorithm, in practice its running time is linear in the number of input sentences. The points of correspondence in SIMR's output are sufficiently dense and precise that GSA backs off only for very small aligned blocks. When the translation lexicon was used in SIMR's matching predicate, the largest aligned block that needed to be re-aligned in the &quot;easy&quot; and &quot;hard&quot; test bitexts was 5x5. Without the translation lexicon, the largest re-aligned block was 7x7. So, GSA's running time is O(kn), where n is the number of input sentences and k is a small constant proportional to the size of the largest re-aligned block.</Paragraph> <Paragraph position="20"> Admittedly, GSA is only useful when a good bitext map is available. In such cases, there are three reasons to favor GSA over other options for alignment: One, it is simply more accurate.</Paragraph> <Paragraph position="21"> Two, its running time is linear in the number of sentences, faster than dynamic programming methods. Therefore, three, it is not necessary to manually segment the component texts into smaller units before input to GSA. GSA works almost as well without such &quot;hard constraints.&quot; Hard constraints are necessary for alignment algorithms that use dynamic programming, in order to maintain an acceptable running time on longer bitexts(Gale & Church 1991, Simard et al. 1992).</Paragraph> <Paragraph position="22"> SIMR produced bitext maps for 200 megabytes of the Canadian Hansards. GSA converted these maps into alignments. The Linguistic Data Consortium plans to publish both the maps and the alignments in the near future.</Paragraph> </Section> class="xml-element"></Paper>