File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1012_metho.xml

Size: 17,200 bytes

Last Modified: 2025-10-06 14:13:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1012">
  <Title>ALIGNING A PARALLEL ENGLISH-CHINESE CORPUS STATISTICALLY WITH LEXICAL CRITERIA</Title>
  <Section position="4" start_page="0" end_page="80" type="metho">
    <SectionTitle>
THE ENGLISH-CHINESE
CORPUS
</SectionTitle>
    <Paragraph position="0"> The dearth of work on non-Indo-European languages can partly be attributed to a lack of the prequisite bilingual corpora. As a step toward remedying this, we are in the process of constructing a suitable English-Chinese corpus. To be included, materials must contain primarily tight, literal sentence translations. This rules out most fiction and literary material.</Paragraph>
    <Paragraph position="1"> We have been concentrating on the Hong Kong Hansard, which are the parliamentary proceedings of the Legislative Council (LegCo). Analogously to the bilingual texts of the Canadian Hansard (Gale &amp; Church 1991), LegCo transcripts are kept in full translation in both English 1Some newer methods are also intended to be applied to non-Indo-European languages in the future (Fung $z Church 1994).</Paragraph>
    <Paragraph position="2">  and Cantonese. 2 However, unlike the Canadian Hansard, the Hong Kong Hansard has not previously been available in machine-readable form. We have obtained and converted these materials by special arrangement.</Paragraph>
    <Paragraph position="3"> The materials contain high-quality literal translation. Statements in LegCo may be made using either English or Cantonese, and are transcribed in the original language. A translation to the other language is made later to yield complete parallel texts, with annotations specifying the source language used by each speaker. Most sentences are translated 1-for-1. A small proportion are 1-for-2 or 2-for-2, and on rare occasion 1-for-3, 3-for-3, or other configurations. Samples of the English and Chinese texts can be seen in figures 3 and 4. 3 Because of the obscure format of the original data, it has been necessary to employ a substantial amount of automatic conversion and reformatting. Sentences are identified automatically using heuristics that depend on punctuation and spacing. Segmentation errors occur occasionally, due either to typographical errors in the original data, or to inadequacies of our automatic conversion heuristics. This simply results in incorrectly placed delimiters; it does not remove any text from the corpus.</Paragraph>
    <Paragraph position="4"> Although the emphasis is on clean text so that markup is minimal, paragraphs and sentences are marked following TEI-conformant SGML (Sperberg-McQueen &amp; Burnard 1992). We use the term &amp;quot;sentence&amp;quot; in a generalized sense including lines in itemized lists, headings, and other non-sentential segments smaller than a paragraph.</Paragraph>
    <Paragraph position="5"> The corpus currently contains about 60Mb of raw data, of which we have been concentrating on approximately 3.2Mb. Of this, 2.1Mb is text comprised of approximately 0.35 million English words, with the corresponding Chinese translation occupying the remaining 1.1Mb.</Paragraph>
  </Section>
  <Section position="5" start_page="80" end_page="80" type="metho">
    <SectionTitle>
STATISTICALLY-BASED
ALIGNMENT
</SectionTitle>
    <Paragraph position="0"> The statistical approach to alignment can be summarized as follows: choose the alignment that maximizes the probability over all possible alignments, given a pair of parallel texts. Formally, 2Cantonese is one of the four major Han Chinese languages. Formal written Cantonese employs the same characters as Mandarin, with some additions.</Paragraph>
    <Paragraph position="1"> Though there are grammatical and usage differences between the Chinese languages, as between German  and Swiss German, the written forms can be read by all.</Paragraph>
    <Paragraph position="2"> 3For further description see also Fung &amp;: Wu (1994). choose (1) arg m~x Pr(A VT1, if-2)  where .A is an alignment, and ~ and &amp;quot;T2 are the English and Chinese texts, respectively. An alignment .A is a set consisting of L1 ~ L~ pairs where each L1 or L2 is an English or Chinese passage. This formulation is so extremely general that it is difficult to argue against its pure form. More controversial are the approximations that must be made to obtain a tractable version.</Paragraph>
    <Paragraph position="3"> The first commonly made approximation is that the probabilities of the individual aligned pairs within an alignment are independent, i.e.,</Paragraph>
    <Paragraph position="5"> The other common approximation is that each Pr(L1 ~- L217-t,7-2) depends not on the entire texts, but only on the contents of the specific passages within the alignment:</Paragraph>
    <Paragraph position="7"> Maximization of this approximation to the alignment probabilities is easily converted into a minimum-sum problem: (2) arg rnAax Pr (.AI~ , ~r~)</Paragraph>
    <Paragraph position="9"> The minimization can be implemented using a dynamic programming strategy.</Paragraph>
    <Paragraph position="10"> Further approximations vary according to the specific method being used. Below, we first discuss a pure length-based approximation, then a method with lexical extensions.</Paragraph>
  </Section>
  <Section position="6" start_page="80" end_page="83" type="metho">
    <SectionTitle>
APPLICABILITY OF LENGTH-
BASED METHODS TO CHINESE
</SectionTitle>
    <Paragraph position="0"> Length-based alignment methods are based on the following approximation to equation (2):</Paragraph>
    <Paragraph position="2"> sured in number of characters. In other words, the only feature of Lt and L2 that affects their alignment probability is their length. Note that there are other length-based alignment methods  that measure length in number of words instead of characters (Brown et al. 1991). However, since Chinese text consists of an unsegmented character stream without marked word boundaries, it would not be possible to count the number of words in a sentence without first parsing it.</Paragraph>
    <Paragraph position="3"> Although it has been suggested that length-based methods are language-independent (Gale &amp; Church 1991; Brown et al. 1991), they may in fact rely to some extent on length correlations arising from the historical relationships of the languages being aligned. If translated sentences share cognates, then the character lengths of those cognates are of course correlated. Grammatical similarities between related languages may also produce correlations in sentence lengths.</Paragraph>
    <Paragraph position="4"> Moreover, the combinatorics of non-Indo-European languages can depart greatly from Indo-European languages. In Chinese, the majority of words are just one or two characters long (though collocations up to four characters are also common). At the same time, there are several thousand characters in daily use, as in conversation or newspaper text. Such lexical differences make it even less obvious whether pure sentence-length criteria are adequately discriminating for statistical alignment.</Paragraph>
    <Paragraph position="5"> Our first goal, therefore, is to test whether purely length-based alignment results can be replicated for English and Chinese, languages from unrelated families. However, before length-based methods can be applied to Chinese, it is first necessary to generalize the notion of &amp;quot;number of characters&amp;quot; to Chinese strings, because most Chinese text (including our corpus) includes occasional English proper names and abbreviations, as well as punctuation marks. Our approach is to count each Chinese character as having length 2, and each English or punctuation character as having length 1. This corresponds to the byte count for text stored in the hybrid English-Chinese encoding system known as Big 5.</Paragraph>
    <Paragraph position="6"> Gale &amp; Church's (1991) length-based alignment method is based on the model that each English character in L1 is responsible for generating some number of characters in L2. This model leads to a further approximation which encapsulates the dependence to a single parameter 6 that is a function of 11 and 1s:</Paragraph>
    <Paragraph position="8"> However, it is much easier to estimate the distributions for the inverted form obtained by applying Bayes' Rule:</Paragraph>
    <Paragraph position="10"> where Pr(6) is a normalizing constant that can be ignored during minimization. The other two distributions are estimated as follows.</Paragraph>
    <Paragraph position="11"> First we choose a function for 6(11,12). To do this we look at the relation between 11 and 12 under the generative model. Figure 1 shows a plot of English versus Chinese sentence lengths for a hand-aligned sample of 142 sentences. If the sentence lengths were perfectly correlated, the points would lie on a diagonal through the origin.</Paragraph>
    <Paragraph position="12"> We estimate the slope of this idealized diagonal</Paragraph>
    <Paragraph position="14"> corpus of hand-aligned L1 ~- L2 pairs, weighting by the length of L1. In fact this plot displays substantially greater scatter than the English-French data of Gale &amp; Church (1991). 4 The mean number of Chinese characters generated by each English character is c = 0.506, with a standard deviation ~r = 0.166.</Paragraph>
    <Paragraph position="15"> We now assume that 12 - llc is normally distributed, following Gale &amp; Church (1991), and transform it into a new gaussian variable of stan- null dard form (i.e., with mean 0 and variance 1) by appropriate normalization: 12 - 11 c (4) x/~l tr 2  This is the quantity that we choose to define as 6(/1,12). Consequently, for any two pairs in a proposed alignment, Pr(6\[Lt ~- L~) can be estimated according to the gaussian assumption.</Paragraph>
    <Paragraph position="16"> To check how accurate the gaussian assumption is, we can use equation (4) to transform the same training points from figure 1 and produce a histogram. The result is shown in figure 2. Again, the distribution deviates from a gaussian distribution substantially more than Gale &amp; Church (1991) report for French/German/English. Moreover, the distribution does not resemble ally smooth distribution at all, including the logarithmic normal used by Brown el al. (1991), raising doubts about the potential performance of pure length-based alignment.</Paragraph>
    <Paragraph position="17"> Continuing nevertheless, to estimate the other term Pr(L1 ~ L2), a prior over six classes is constructed, where the classes are defined by the nmnber of passages included within L1 and L2. Table 1 shows the probabilities used. These probabilities are taken directly from Gale &amp; Church (1991); slightly improved performance might be obtained by estimating these probabilities from our corpus. The aligned results using this model were evaluated by hand for the entire contents of a ran4The difference is also partly due to the fact that Gale &amp; Church (1991) plot paragraph lengths instead of sentence lengths. We have chosen to plot sentence lengths because that is what the algorithm is based on.</Paragraph>
    <Paragraph position="18">  1. PMR FRED LI ( in Cantonese ) : J 2. I would like to talk about public assistance. J 3. I notice from your address that under the Public AssistanceScheme, thebasicrateof$825amonth~ra~825~950~,~15%o \] single adult will be increased by 15% to $950 a month. l 4. However, do you know that the revised rate plus all other grants will give each recipient no more than $2000 a month? On average, each recipient will receive $1600 to $1700 a month. \] 5. In view of Hong Kong's prosperity and high living cost, this figure is very ironical. J 6. May I have your views and that of the Government? \] 7. Do you think that a comprehensive review should be conducted on the method of calculating public assistance? \] 8. Since the basic rate is so low, it will still be far below the current level of living even if it is further increased by 20% to 30%. If no comprehensive review is carried out in this aspect, this &amp;quot; safety net &amp;quot; cannot provide any assistance at all for those who are really in need. J 9. I hope Mr Governor will give this question a serious response. J 10. PTHE GOVERNOR: J 11. It is not in any way to belittle the importance of the  point that the Honourable Member has made to say that, when at the outset of our discussions I said that I did not think that the Government would be regarded for long as having been extravagant yesterday, I did not realize that the criticisms would begin quite as rapidly as they have. \] 12. The proposals that we make on public assistance, both the increase in scale rates, and the relaxation of the absence rule, are substantial steps forward in Hong Kong which will, I think, be very widely welcomed. J 13. But I know that there will always be those who, I am sure for very good reason, will say you should have gone further, you should have clone more. J 14. Societies customarily make advances in social welfare because there are members of the community who develop that sort of case very often with eloquence and  of the true 1-for-1 pairs are aligned correctly. In (4), two English sentences are correctly aligned with a single Chinese sentence. However, the English sentences in (6, 7) are incorrectly aligned 1for- 1 instead of 2-for- 1. Also, (11, 12) shows an example of a 3-for-l, 1-for-1 sequence that the model has no choice but to align as 2-for-2, 2-for-2.</Paragraph>
    <Paragraph position="19"> Judging relative to a manual alignment of the English and Chinese files, a total of 86.4% of the true L1 ~- L~ pairs were correctly identified by the length-based method. However, many of the errors occurred within the introductory session header, whose format is domain-specific (dis-</Paragraph>
    <Paragraph position="21"> then the proportion of correctly aligned pairs rises to 95.2%, a respectable rate especially in view of the drastic inaccuracies in the distributions assumed. A detailed breakdown of the results is shown in Table 2. For reference, results reported for English/French generally fall between 96% and 98%. However, all of these numbers should be interpreted as highly domain dependent, with very small sample size.</Paragraph>
    <Paragraph position="22"> The above rates are for Type I errors. The alternative measure of accuracy on Type II errors is useful for machine translation applications, where the objective is to extract only 1-for-1 sentence pairs, and to discard all others. In this case, we are interested in the proportion of 1-for-1 output pairs that are true 1-for-1 pairs. (In information retrieval terminology, this measures precision whereas the above measures recall.) In the test session, 438 1-for-1 pairs were output, of which 377, or 86.1%, were true matches. Again, however, by discarding the introduction, the accuracy rises to a surprising 96.3%.</Paragraph>
    <Paragraph position="23">  The introductory session header exemplifies a weakness of the pure length-based strategy, namely, its susceptibility to long stretches of passages with roughly similar lengths. In our data this arises from the list of council members present and absent at each session (figure 4), but similar stretches can arise in many other domains. In such a situation, two slight perturbations may cause the entire stretch of passages between the perturbations to be misaligned. These perturbations can easily arise from a number of causes, including slight omissions or mismatches in the original parallel texts, a 1-for-2 translation pair preceding or following the stretch of passages, or errors in the heuristic segmentation preprocessing. Substantial penalties may occur at the beginning and ending boundaries of the misaligned region, where the perturbations lie, but the misalignment between those boundaries incurs little penalty, because the mismatched passages have apparently matching lengths. This problem is apparently exacerbated by the non-alphabetic nature of Chinese. Because Chinese text contains fewer characters, character length is a less discriminating feature, varying over a range of fewer possible discrete values than the corresponding English. The next section discusses a solution to this problem.</Paragraph>
    <Paragraph position="24"> In summary, we have found that the statistical correlation of sentence lengths has a far greater variance for our English-Chinese materials than with the Indo-European materials used by Gale &amp; Church (1991). Despite this, the pure length-based method performs surprisingly well, except for its weakness in handling long stretches of sentences with close lengths.</Paragraph>
  </Section>
  <Section position="7" start_page="83" end_page="84" type="metho">
    <SectionTitle>
STATISTICAL INCORPORATION
OF LEXICAL CUES
</SectionTitle>
    <Paragraph position="0"> To obtain further improvement in alignment accuracy requires matching the passages' lexical content, rather than using pure length criteria. This is particularly relevant for the type of long mismatched stretches described above.</Paragraph>
    <Paragraph position="1"> Previous work on alignment has employed ei-</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML