File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/93/j93-1006_relat.xml

Size: 4,234 bytes

Last Modified: 2025-10-06 14:16:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1006">
  <Title>Text-Translation Alignment</Title>
  <Section position="7" start_page="138" end_page="140" type="relat">
    <SectionTitle>
5. Related Work
</SectionTitle>
    <Paragraph position="0"> Since we addressed the text translation alignment problem in 1988, a number of researchers, among them Gale and Church (1991) and Brown, Lai, and Mercer (1991), have worked on the problem. Both methods are based on the observation that the length of text unit is highly correlated to the length of the translation of this unit, no matter whether length is measured in number of words or in number of characters (see Figure 6). Consequently, they are both easier to implement than ours, though not necessarily more efficient. The method of Brown, Lai, and Mercer (1991) is based on a hidden Markov model for the generation of aligned pairs of corpora, whose parameters are estimated from a large text. For an application of this method to the Canadian Hansard, good results are reported. However, the problem was also considerably facilitated by the way the implementation made use of Hansard-specific comments  paragraphs. Left: length measured in words. Right: length measured in characters.</Paragraph>
    <Paragraph position="1"> and annotations: these are used in a preprocessing step to find anchors for sentence alignment such that, on average, there are only ten sentences in between. Moreover, this particular corpus is well known for the near literalness of its translations, and it is therefore unclear to what extent the good results are due to the relative ease of the problem. This would be an important consideration when comparing various algorithms; when the algorithms are actually applied, it is clearly very desirable to incorporate as much prior knowledge (say, on potential anchors) as possible. Moreover, long texts can almost always be expected to contain natural anchors, such as chapter and section headings, at which to make an a priori segmentation.</Paragraph>
    <Paragraph position="2"> Gale and Church (1991) note that their method performed considerably better when lengths of sentences were measured in number of characters instead of in number of words. Their method is based on a probabilistic model of the distance between two sentences, and a dynamic programming algorithm is used to minimize the total distance between aligned units. Their implementation assumes that each character in one language gives rise to, on average, one character in the other language. 8 In our texts, one character in English on average gives rise to somewhat more than 1.2 characters in German, and the correlation between the lengths (in characters) of aligned paragraphs in the two languages was with 0.952 lower than the 0.991 that are mentioned in Gale and Church (1991), which supports our impression that the Scientific American texts we used are hard texts to align, but it is not clear to what extent this would deteriorate the results. In applications to economic reports from the Union Bank of Switzerland, the method performs very well on simple alignments (one-to-one, oneto-two), but has at the moment problems with complex matches. The method has the 8 Recall that, in a similar way, we assumed in our implementation that one sentence in one language gives rise to, on average, n/m sentences in the other language (see first footnote in Section 2.3).  Martin Kay and Martin R6scheisen Text-Translation Alignment advantage of associating a score with pairs of sentences so that it is easy to extract a subset for which there is a high likelihood that the alignments are correct.</Paragraph>
    <Paragraph position="3"> Given the simplicity of the methods proposed by Brown, Lai, and Mercer and Gale and Church, either of them could be used as a heuristic in the construction of the initial AST in our algorithm. In the current version, the number of candidate sentence pairs that are considered in the first pass near the middle of a text contributes disproportionally to the cost of the computation. In fact, as we remarked earlier, the complexity of this step is O(nvFff). The proposed modification would effectively make it linear.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML