XML Viewer - p98-1042

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1042_metho.xml
Size: 14,848 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1042">
  <Title>An Experiment in Hybrid Dictionary and Statistical Sentence Alignment</Title>
  <Section position="4" start_page="0" end_page="268" type="metho">
    <SectionTitle>
2 Bilingual Sentence Alignment
</SectionTitle>
    <Paragraph position="0"> The task of sentence alignment is to match corresponding sentences in a text from One language to sentences in a translation of that text in another language. Of particular interest to us is the application to Asian language pairs. Previous studies such as (Fung and Wu, 1994) have commented  that methods developed for Indo-European language pairs using alphabetic characters have not addressed important issues which occur with European-Asian language pairs. For example, the language pairs are unlikely to be cognates, and they may place sentence boundaries at different points in the text. It has also been suggested by (Wu, 1994) that sentence length ratio correlations may arise partly out of historic cognate-based relationships between Indo-European languages. Methods which perform well for Indo-European language pairs have therefore been found to be less effective for non-Indo-European language pairs.</Paragraph>
    <Paragraph position="1"> In our experiments the languages we use are English (source) and Japanese (translation). Although in our corpus (described below) we observe that, in general, sentences correspond one-to-one we must also consider multiple sentence correspondences as well as one-to-zero correspondences. These cases are  summarised below.</Paragraph>
    <Paragraph position="2"> 1. 1:1 The sentences match one-to-one.</Paragraph>
    <Paragraph position="3"> 2. l:n One English sentence matches to more than one Japanese sentence.</Paragraph>
    <Paragraph position="4"> 3. m:l More than one English sentence matches ot one Japanese sentence.</Paragraph>
    <Paragraph position="5"> 4. m:n More than one English sentence matches to more than one Japanese sentence.</Paragraph>
    <Paragraph position="6"> 5. m:0 The English sentence/s have no corresponding Japanese sentence.</Paragraph>
    <Paragraph position="7"> 6. 0:n The Japanese sentence/s have no corre- null sponding English sentence.</Paragraph>
    <Paragraph position="8"> In the case of l:n, m:l and m:n correspondences, translation has involved some reformatting and the meaning correspondence is no longer solely at the sentence level. Ideally we would like smaller units of text to match because it is easier later on to establish word alignment correspondences. In the worst case of multiple correspondence, the translation is spread across multiple non-consecutive sentences.</Paragraph>
  </Section>
  <Section position="5" start_page="268" end_page="268" type="metho">
    <SectionTitle>
3 Corpus
</SectionTitle>
    <Paragraph position="0"> Our primarily motivation is knowledge acquisition for machine translation and consequently we are interested to acquire vocabulary and other bilingual knowledge which will be useful for users of such systems. Recently there has been a move towards Internet page translation and we consider that one interesting domain for users is international news.</Paragraph>
    <Paragraph position="1"> The bilingual corpus we use in our experiments is made from Reuter news articles which were translated by the Gakken translation agency from English into Japanese 1 . The translations are quite literal and the contents cover international news for I The corpus was generously made available to us by special arrangement with Gakken the period February 1995 to December 1996. We currently have over 20,000 articles (approximately 47 Mb). From this corpus we randomly chose 50 article pairs and aligned them by hand using a human bilingual checker to form a judgement set. The judgement set consists of 380 English sentences and 453 Japanese sentences. On average each English article has 8 lines and each Japanese article 9 lines.</Paragraph>
    <Paragraph position="2"> The articles themselves form a boundary within which to align constituent sentences. The corpus is quite well behaved. We observe many 1:1 correspondences, but also a large proportion of 1:2 and 1:3 correspondences as well as reorderings. Omissions seem to be quite rare, so we didn't see many m:0 or 0:n correspondences.</Paragraph>
    <Paragraph position="3"> An example news article is shown in Figure 1 which highlights several interesting points. Although the news article texts are clean and in machine-tractable format we still found that it was a significant challenge to reliably identify sentence boundaries. A simple illustration of this is shown by the first Japanese line J1 which usually corresponds to the first two English lines E1 and E2. This is a result of our general-purpose sentence segmentation algorithm which has difficulty separating the Japanese title from the first sentence.</Paragraph>
    <Paragraph position="4"> Sentences usually corresponded linearly in our corpus, with few reorderings, so the major challenge was to identify multiple correspondences and zero correspondences. We can see an example of a zero correspondence as E5 has no translation in the Japanese text. A l:n correspondence is shown by E7 aligning to both J5 and J6.</Paragraph>
  </Section>
  <Section position="6" start_page="268" end_page="271" type="metho">
    <SectionTitle>
4 Alignment Models
</SectionTitle>
    <Paragraph position="0"> In our investigation we examined the performance of three different matching models (lexical matching, byte-length ratios and offset probabilities). The basic models incorporate dynamic programming to find the least cost alignment path over the set of English and Japanese sentences. Cost being determined by the model's scores. The alignment space includes all possible combinations of multiple matches upto and including 3:3 alignments. The basic models are now outlined below.</Paragraph>
    <Section position="1" start_page="268" end_page="270" type="sub_section">
      <SectionTitle>
4.1 Model 1: Lexical vector matching
</SectionTitle>
      <Paragraph position="0"> The lexical approach is perhaps the most robust for aligning texts in cognate language pairs, or where there is a large amount of reformatting in translation. It has also been shown to be particularly successful within the vector space model in multilingual information retrieval tasks, e.g. (Collier et al., 1998a),(Collier et al., 1998b), for aligning texts in non-cognate languages at the article level.</Paragraph>
      <Paragraph position="1"> The major limitation with lexical matching is clearly the assumption of lexical correspondence - null El. Taiwan ruling party sees power struggle in China E2. TAIPEI , Feb 9 ( Reuter ) - Taiwan's ruling Nationalist Party said a struggle to succeed Deng Xiaoping as China's most powerful man may have already begun.  E3. &amp;quot;Once Deng Xiaoping dies, a high tier power struggle among the Chinese communists is inevitable,&amp;quot; a Nationalist Party report said.</Paragraph>
      <Paragraph position="2"> E4. China and Taiwan have been rivals since the Nationalists lost the Chinese civil war in 1949 and fled to Taiwan.</Paragraph>
      <Paragraph position="3"> E5. Both Beijing and Taipei sometimes portray each other in an unfavourable light. E6. The report said that the position of Deng's chosen successor, President 3iang Zemin, may have been subtly undermined of late.</Paragraph>
      <Paragraph position="4"> E7. It based its opinion on the fact that two heavyweight political figures have recently used the phrase the &amp;quot;solid central collective leadership and its core&amp;quot; instead of the accepted &amp;quot;collective leadership centred on Jiang Zemin&amp;quot; to describe the current leadership structure. E8. &amp;quot;Such a sensitive statement should not be an unintentional mistake ... E9. Does this mean the power struggle has gradually surfaced while Deng Xiaoping is still alive ?,&amp;quot; said the report , distributed to journalists.</Paragraph>
      <Paragraph position="5"> El0. &amp;quot;At least the information sends a warning signal that the 'core of Jiang' has encountered some subtle changes,&amp;quot; it added .</Paragraph>
      <Paragraph position="6"> 31. ~'~l~l~.~l~:~, ~P\[~:,~.-,i~-~~'a~t.~l~j~'~.&amp;quot;~/~:i~.fl~.~:/t'~'H:\]~ \[~'~ 9 13 ~ -I' 9--\] ~'~'~ J2. ~l~:, ~~.~6t:i~.~L,/~_@~(c)~&amp;quot;e, r l-e).,j,~~, ~,~-~.~.~e, ~,~@~,,  which is particularly weak for English and Asian language pairs where structural and semantic differences mean that transfer often occurs at a level above the lexicon. This is a motivation for incorporating statistics into the alignment process, but in the initial stage we wanted to treat pure lexical matching as our baseline performance.</Paragraph>
      <Paragraph position="7"> We translated each Japanese sentence into English using dictionary term lookup. Each Japanese content word was assigned a list of possible English translations and these were used to match against the normalised English words in the English sentences. For an English text segment E and the English term list produced from a Japanese text segment J, which we considered to be a possible unit of correspondence, we calculated similarity using Dice's coefficient score shown in Equation 1. This rather simple measure captures frequency, but not positional information, q_\]m weights of words are their frequencies inside a sentence.</Paragraph>
      <Paragraph position="9"> where lea is the number of lexical items which match in E and J, fE is tile number of lexical items in E and fj is the number of lexical items in J.</Paragraph>
      <Paragraph position="10"> The translation lists for each Japanese word are used disjunctively, so if one word in the list matches then we do not consider the other terms in the list. In this way we maintain term independence.</Paragraph>
      <Paragraph position="11">  Our transfer dictionary contained some 79,000 English words in full form together with the list of translations in Japanese. Of these English words some 14,000 were proper nouns which were directly relevant to the vocabulary typically found in international news stories. Additionally we perform lexical normalisation before calculating the matching score and remove function words with a stop list.</Paragraph>
    </Section>
    <Section position="2" start_page="270" end_page="271" type="sub_section">
      <SectionTitle>
4.2 Model 2: Byte-length ratios
</SectionTitle>
      <Paragraph position="0"> For Asian language pairs we cannot rely entirely on dictionary term matching. Moreover, algorithms which rely on matching cognates cannot be applied easily to English and some Asian language. We were motivated by statistical alignment models such as (Gale and Church, 1991) to investigate whether byte-length probabilities could improve or replace the lexical matching based method. The underlying assumption is that characters in an English sentence are responsible for generating some fraction of each character in the corresponding Japanese sentence.</Paragraph>
      <Paragraph position="1"> We derived a probability density function by making the assumption that English .and Japanese sentence length ratios are normally distributed. The parameters required for the model are the mean, p and variance, ~, which we calculated from a training set of 450 hand-aligned sentences. These are then entered into Equation 2 to find the probability of any two sentences (or combinations of sentences for multiple alignments) being in an alignment relation given that they have a length ratio of x.</Paragraph>
      <Paragraph position="2"> The byte length ratios were calculated as the length of the Japanese text segment divided by the length of the English text segment. So in this way we can incorporate multiple sentence correspondences into our model. Byte lengths for English sentences are calculated according to the number of non-white space characters, with a weighting of 1 for each valid character including punctuation. For the Japanese text we counted 2 for each non-white space character. White spaces were treated as having length 0. The ratios for the training set are shown as a histogram in Figure 2 and seem to support the assumption of a normal distribution.</Paragraph>
      <Paragraph position="3"> The resulting normal curve with ~r = 0.33 and /1 = 0.76 is given in Figure 3, and this can then be used to provide a probability score for any English and Japanese sentence being aligned in the Reuters' corpus.</Paragraph>
      <Paragraph position="4"> Clearly it is not enough simply to assume that our sentence pair lengths follow the normal distribution.</Paragraph>
      <Paragraph position="5"> We tested this assumption using a standard test, by plotting the ordered ratio scores against the values calculated for the normal curve in Figure 3. If the</Paragraph>
      <Paragraph position="7"> distribution is indeed normal then we would expect the plot in Figure 4 to yi,?ld a straight line. We can see that this is the case l:',r most, although not all, of the observed scores.</Paragraph>
      <Paragraph position="8"> Although the curve in Figure 4 shows that our training set deviated from the normal distribution at</Paragraph>
      <Paragraph position="10"/>
      <Paragraph position="12"> the extremes we nevertheless proceeded to continue with our simulations using this model considering that the deviations occured at the extreme ends of the distribution where relatively few samples were found. The weakness of this assumption however does add extra evidence to doubts which have been raised, e.g. (Wu, 1994), about whether the byte-length model by itself can perform well.</Paragraph>
    </Section>
    <Section position="3" start_page="271" end_page="271" type="sub_section">
      <SectionTitle>
4.3 Model 3: Offset ratios
</SectionTitle>
      <Paragraph position="0"> We calculated the offsets in the sentence indexes for English and Japanese sentences in an alignment relation in the hand-aligned training set. An offset difference was calculated as the Japanese sentence index minus the English sentence index within a bilingual news article pair. The values are shown as a histogram in Figure 5.</Paragraph>
      <Paragraph position="1"> As with the byte-length ratio model, we started from an assumption that sentence correspondence offsets were normally distributed. We then calculated the mean and variance for our sample set shown in Figure 5 and used this to form a normal probability density function (where a = 0.50 and /J - 1.45) shown in Figure 6.</Paragraph>
      <Paragraph position="2"> The test for normality of the distribution is the same as for byte-length ratios and is given in Figure 7. We can see that the assumption of normality is particularly weak for the offset distribution, but we are motivated to see whether such a noisy probability model can improve alignment results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML