File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1010_metho.xml
Size: 24,295 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1010"> <Title>Reliable Measures for Aligning Japanese-English News Articles and Sentences</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Basic Alignment Methods </SectionTitle> <Paragraph position="0"> We adopt a standard strategy to align articles and sentences. First, we use a method based on CLIR to align Japanese and English articles (Collier et al., 1998; Matsumoto and Tanaka, 2002) and then a method based on DP matching to align Japanese and English sentences (Gale and Church, 1993; Utsuro et al., 1994) in these articles. As each of these methods uses existing NLP techniques, we describe them briefly focusing on basic similarity measures, which we will compare with our proposed measures in Section 5.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Article alignment </SectionTitle> <Paragraph position="0"> Translation of words We first convert each of the Japanese articles into a set of English words. We use ChaSen1 to segment each of the Japanese articles into words. We next extract content words, which are then translated into English words by looking them up in the EDR Japanese-English bilingual dictionary,2 EDICT, and and 180,000 entries, respectively. We select two English words for each of the Japanese words using simple heuristic rules based on the frequencies of English words.</Paragraph> <Paragraph position="1"> Article retrieval We use each of the English articles as a query and search for the Japanese article that is most similar to the query article. The similarity between an English article and a (word-based English translation of) Japanese article is measured by BM25 (Robertson and Walker, 1994). BM25 and its variants have been proven to be quite efficient in information retrieval. Readers are referred to papers by the Text REtrieval Conference (TREC)4, for example.</Paragraph> <Paragraph position="2"> The definition of BM25 is:</Paragraph> <Paragraph position="4"> where J is the set of translated English words of a Japanese article and E is the set of words of an English article. The words are stemmed and stop words are removed.</Paragraph> <Paragraph position="5"> T is a word contained in E.</Paragraph> <Paragraph position="6"> w(1) is the weight of T, w(1) = log (N!n+0:5)(n+0:5) . N is the number of Japanese articles to be searched. n is the number of articles containing T.</Paragraph> <Paragraph position="7"> K is k1((1 ! b) + b dlavdl). k1, b and k3 are parameters set to 1, 1, and 1000, respectively. dl is the document length of J and avdl is the average document length in words.</Paragraph> <Paragraph position="8"> tf is the frequency of occurrence of T in J. qtf is the frequency of T in E.</Paragraph> <Paragraph position="9"> To summarize, we first translate each of the Japanese articles into a set of English words. We then use each of the English articles as a query and search for the most similar Japanese article in terms of BM25 and assume that it corresponds to the English article.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Sentence alignment </SectionTitle> <Paragraph position="0"> The sentences5 in the aligned Japanese and English articles are aligned by a method based on DP matching (Gale and Church, 1993; Utsuro et al., 1994).</Paragraph> <Paragraph position="1"> ple heuristics and split the English articles into sentences by using MXTERMINATOR (Reynar and Ratnaparkhi, 1997). We allow 1-to-n or n-to-1 (1 * n * 6) alignments when aligning the sentences. Readers are referred to Utsuro et al. (1994) for a concise description of the algorithm. Here, we only discuss the similarities between Japanese and English sentences for alignment. Let Ji and Ei be the words of Japanese and English sentences for i-th alignment. The similarity6 between Ji and Ei is:</Paragraph> <Paragraph position="3"> a one-to-one correspondence between Japanese and English words.</Paragraph> <Paragraph position="4"> Ji and Ei are obtained as follows. We use ChaSen to morphologically analyze the Japanese sentences and extract content words, which consists of Ji. We use Brill's tagger (Brill, 1992) to POS-tag the English sentences, extract content words, and use Word-Net's library7 to obtain lemmas of the words, which consists of Ei. We use simple heuristics to obtain Ji PSEi, i.e., a one-to-one correspondence between the words in Ji and Ei, by looking up Japanese-English and English-Japanese dictionaries made up by combining entries in the EDR Japanese-English bilingual dictionary and the EDR English-Japanese bilingual dictionary. Each of the constructed dictionaries has over 300,000 entries.</Paragraph> <Paragraph position="5"> We evaluated the implemented program against a corpus consisting of manually aligned Japanese and English sentences. The source texts were Japanese white papers (JEIDA, 2000). The style of translation was generally literal reflecting the nature of government documents. We used 12 pairs of texts for evaluation. The average number of Japanese sentences per text was 413 and that of English sentences was 495.</Paragraph> <Paragraph position="6"> The recall, R, and precision, P, of the program against this corpus were R = 0:982 and P = 0:986, respectively, where 6SIM(Ji;Ei) is different from the similarity function used in Utsuro et al. (1994). We use SIM because it performed well in a preliminary experiment.</Paragraph> <Paragraph position="8"> The number of pairs in a one-to-n alignment is n.</Paragraph> <Paragraph position="9"> For example, if sentences fJ1g and fE1;E2;E3g are aligned, then three pairs hJ1;E1i, hJ1;E2i, and hJ1;E3i are obtained.</Paragraph> <Paragraph position="10"> This recall and precision are quite good considering the relatively large differences in the language structures between Japanese and English.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Reliable Measures </SectionTitle> <Paragraph position="0"> We use BM25 and SIM to evaluate the similarity in articles and sentences, respectively. These measures, however, cannot be used to reliably discriminate between correct and incorrect alignments as will be discussed in Section 5. This motivated us to devise more reliable measures based on basic similarities. null BM25 measures the similarity between two bags of words. It is not sensitive to differences in the order of sentences between two articles. To remedy this, we define a measure that uses the similarities in sentence alignments in the article alignment. We define AVSIM(J;E) as the similarity between Japanese article, J, and English article, E:</Paragraph> <Paragraph position="2"> where (J1;E1);(J2;E2);:::(Jm;Em) are the sentence alignments obtained by the method described in Section 3.2. The sentence alignments in a correctly aligned article alignment should have more similarity than the ones in an incorrectly aligned article alignment. Consequently, article alignments with high AVSIM are likely to be correct.</Paragraph> <Paragraph position="3"> Our sentence alignment program aligns sentences accurately if the English sentences are literal translations of the Japanese as discussed in Section 3.2. However, the relation between English news sentences and Japanese news sentences are not literal translations. Thus, the results for sentence alignments include many incorrect alignments. To discriminate between correct and incorrect alignments, we take advantage of the similarity in article alignments containing sentence alignments so that the sentence alignments in a similar article alignment will have a high value. We define</Paragraph> <Paragraph position="5"> SntScore(Ji;Ei) is the similarity in the i-th alignment, (Ji;Ei), in article alignment J and E. When we compare the validity of two sentence alignments in the same article alignment, the rank order of sentence alignments obtained by applying SntScore is the same as that of SIM because they share a common AVSIM. However, when we compare the validity of two sentence alignments in different article alignments, SntScore prefers the sentence alignment with the more similar (high AVSIM) article alignment even if their SIM has the same value, while SIM cannot discriminate between the validity of two sentence alignments if their SIM has the same value.</Paragraph> <Paragraph position="6"> Therefore, SntScore is more appropriate than SIM if we want to compare sentence alignments in different article alignments, because, in general, a sentence alignment in a reliable article alignment is more reliable than one in an unreliable article alignment.</Paragraph> <Paragraph position="7"> The next section compares the effectiveness of AVSIM to that of BM25, and that of SntScore to that of SIM.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Evaluation of Alignment </SectionTitle> <Paragraph position="0"> Here, we discuss the results of evaluating article and sentence alignments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Evaluation of article alignment </SectionTitle> <Paragraph position="0"> We first estimate the precision of article alignments by using randomly sampled alignments. Next, we sort them in descending order of BM25 and AVSIM to see whether these measures can be used to provide correct alignments with a high ranking. Finally, we show that the absolute values of AVSIM correspond well with human judgment.</Paragraph> <Paragraph position="1"> Randomly sampled article alignments Each English article was aligned with a Japanese article with the highest BM25. We sampled 100 article alignments from each of 1996-2001 and 19891996. We then classified the samples into four categories: &quot;A&quot;, &quot;B&quot;, &quot;C&quot;, and &quot;D&quot;. &quot;A&quot; means that there was more than 50% to 60% overlap in the content of articles. &quot;B&quot; means more than 20% to 30% and less than 50% to 60% overlap. &quot;D&quot; means that there was no overlap at all. &quot;C&quot; means that alignment was not included in &quot;A&quot;,&quot;B&quot; or &quot;D&quot;. We regard alignments that were judged to be A or B to be suitable for NLP because of their relatively large overlap.</Paragraph> <Paragraph position="2"> The results of evaluations are in Table 1.8 Here, &quot;ratio&quot; means the ratio of the number of articles judged to correspond to the respective category against the total number of articles. For example, 0.59 in line &quot;A&quot; of 1996-2001 means that 59 out of 100 samples were evaluated as A. &quot;Lower&quot; and &quot;upper&quot; mean the lower and upper bounds of the 95% confidence interval for ratio.</Paragraph> <Paragraph position="3"> The table shows that the precision (= sum of the ratios of A and B) for 1996-2001 was higher than that for 1989-1996. They were 0.71 for 1996-2001 and 0.44 for 1989-1996. This is because the English articles from 1996-2001 were translations of Japanese articles, while those from 1989-1996 were not necessarily translations as explained in Section 2. Although the precision for 1996-2001 was higher than that for 1989-1996, it is still too low to use them as NLP resources. In other words, the article alignments included many incorrect alignments.</Paragraph> <Paragraph position="4"> We want to extract alignments which will be evaluated as A or B from these noisy alignments. To do this, we have to sort all alignments according to some measures that determine their validity and extract highly ranked ones. To achieve this, AVSIM is more reliable than BM25 as is explained below.</Paragraph> <Paragraph position="5"> 8The evaluations were done by the authors. We double checked the sample articles from 1996-2001. Our second checks are presented in Table 1. The ratio of categories in the first check were A=0.62, B=0.09, C=0.09, and D=0.20. Comparing these figures with those in Table 1, we concluded that first and second evaluations were consistent.</Paragraph> <Paragraph position="6"> Sorted alignments: AVSIM vs. BM25 We sorted the same alignments in Table 1 in decreasing order of AVSIM and BM25. Alignments judged to be A or B were regarded as correct. The number, N, of correct alignments and precision, P, up to each rank are shown in Table 2.</Paragraph> <Paragraph position="7"> From the table, we can conclude that AVSIM ranks correct alignments higher than BM25. Its greater accuracy indicates that it is important to take similarities in sentence alignments into account when estimating the validity of article alignments.</Paragraph> <Paragraph position="8"> AVSIM and human judgment Table 2 shows that AVSIM is reliable in ranking correct and incorrect alignments. This section reveals that not only rank order but also absolute values of AVSIM are reliable for discriminating between correct and incorrect alignments. That is, they correspond well with human evaluations. This means that a threshold value is set for each of 1996-2001 and 1989-1996 so that valid alignments can be extracted by selecting alignments whose AVSIM is larger than the threshold.</Paragraph> <Paragraph position="9"> We used the same data in Table 1 to calculate statistics on AVSIM. They are shown in Tables 3 and 4 for 1996-2001 and 1989-1996, respectively.</Paragraph> <Paragraph position="10"> type N lower av. upper th. sig.</Paragraph> <Paragraph position="11"> In these tables, &quot;N&quot; means the number of alignments against the corresponding human judgment.</Paragraph> <Paragraph position="12"> type N lower av. upper th. sig.</Paragraph> <Paragraph position="13"> &quot;Av.&quot; means the average value of AVSIM. &quot;Lower&quot; and &quot;upper&quot; mean the lower and upper bounds of the 95% confidence interval for the average. &quot;Th.&quot; means the threshold for AVSIM that can be used to discriminate between the alignments estimated to be the corresponding evaluations. For example, in Table 3, evaluations A and B are separated by 0.168.</Paragraph> <Paragraph position="14"> These thresholds were identified through linear discriminant analysis. The asterisks &quot;**&quot; and &quot;*&quot; in the &quot;sig.&quot; column mean that the difference in averages for AVSIM is statistically significant at 1% and 5% based on a one-sided Welch test.</Paragraph> <Paragraph position="15"> In these tables, except for the differences in the averages for B and C in Table 4, all differences in averages are statistically significant. This indicates that AVSIM can discriminate between differences in judgment. In other words, the AVSIM values correspond well with human judgment. We then tried to determine why B and C in Table 4 were not separated by inspecting the article alignments and found that alignments evaluated as C in Table 4 had relatively large overlaps compared with alignments judged as C in Table 3. It was more difficult to distinguish B or C in Table 4 than in Table 3.</Paragraph> <Paragraph position="16"> We next classified all article alignments in 1996-2001 and 1989-1996 based on the thresholds in Tables 3 and 4. The numbers of alignments are in Table 5. It shows that the number of alignments estimated to be A or B was 46738 (= 31495 + 15243). We regard about 47,000 article alignments to be sufficiently large to be useful as a resource for NLP such as bilingual lexicon acquisition and for language education. null In summary, AVSIM is more reliable than BM25 and corresponds well with human judgment. By using thresholds, we can extract about 47,000 article alignments which are estimated to be A or B evaluations. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Evaluation of sentence alignment </SectionTitle> <Paragraph position="0"> Sentence alignments in article alignments have many errors even if they have been obtained from correct article alignments due to free translation as discussed in Section 2. To extract only correct alignments, we sorted whole sentence alignments in whole article alignments in decreasing order of SntScore and selected only the higher ranked sentence alignments so that the selected alignments would be sufficiently precise to be useful as NLP resources.</Paragraph> <Paragraph position="1"> The number of whole sentence alignments was about 1,300,000. The most important category for sentence alignment is one-to-one. Thus, we want to discard as many errors in this category as possible. In the first step, we classified whole one-to-one alignments into two classes: the first consisted of alignments whose Japanese and English sentences ended with periods, question marks, exclamation marks, or other readily identifiable characteristics. We call this class &quot;one-to-one&quot;. The second class consisted of the one-to-one alignments not belonging to the first class. The alignments in this class, together with the whole one-to-n alignments, are called &quot;one-to-many&quot;. One-to-one had about 640,000 alignments and one-to-many had about 660,000 alignments.</Paragraph> <Paragraph position="2"> We first evaluated the precision of one-to-one alignments by sorting them in decreasing order of SntScore. We randomly extracted 100 samples from each of 10 blocks ranked at the top-300,000 alignments. (A block had 30,000 alignments.) We classified these 1000 samples into two classes: The first was &quot;match&quot; (A), the second was &quot;not match&quot; (D). We judged a sample as &quot;A&quot; if the Japanese and English sentences of the sample shared a common event (approximately a clause). &quot;D&quot; consisted of the samples not belonging to &quot;A&quot;. The results of evaluation are in Table 6.9 9Evaluations were done by the authors. We double checked all samples. In the 100 samples, there were a maximum of two or three where the first and second evaluations were different. This table shows that the number of A's decreases rapidly as the rank increases. This means that SntScore ranks appropriate one-to-one alignments highly. The table indicates that the top-150,000 one-to-one alignments are sufficiently reliable.10 The ratio of A's in these alignments was 0.982.</Paragraph> <Paragraph position="3"> We then evaluated precision for one-to-many alignments by sorting them in decreasing order of SntScore. We classified one-to-many into three categories: &quot;1-90000&quot;, &quot;90001-180000&quot;, and &quot;180001270000&quot;, each of which was covered by the range of SntScore of one-to-one that was presented in Table</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6. We randomly sampled 100 one-to-many align- </SectionTitle> <Paragraph position="0"> ments from these categories and judged them to be A or D (see Table 7). Table 7 indicates that the 38,090 alignments in the range from &quot;1-90000&quot; are sufficiently reliable.</Paragraph> <Paragraph position="1"> Tables 6 and 7 show that we can extract valid alignments by sorting alignments according to SntScore and by selecting only higher ranked sentence alignments.</Paragraph> <Paragraph position="2"> Overall, evaluations between the first and second check were consistent.</Paragraph> <Paragraph position="3"> 10The notion of &quot;appropriate (correct) sentence alignment&quot; depends on applications. Machine translation, for example, may require more precise (literal) alignment. To get literal alignments beyond a sharing of a common event, we will select a set of alignments from the top of the sorted alignments that satisfies the required literalness. This is because, in general, higher ranked alignments are more literal translations, because those alignments tend to have many one-to-one corresponding words and to be contained in highly similar article alignments. Comparison with SIM We compared SntScore with SIM and found that SntScore is more reliable than SIM in discriminating between correct and incorrect alignments.</Paragraph> <Paragraph position="4"> We first sorted the one-to-one alignments in decreasing order of SIM and randomly sampled 100 alignments from the top-150,000 alignments. We classified the samples into A or D. The number of A's was 93, and that of D's was 7. The precision was 0.93. However, in Table 6, the number of A's was 491 and D's was 9, for the 500 samples extracted from the top-150,000 alignments. The precision was 0.982. Thus, the precision of SntScore was higher than that of SIM and this difference is statistically significant at 1% based on a one-sided proportional test.</Paragraph> <Paragraph position="5"> We then sorted the one-to-many alignments by SIM and sampled 100 alignments from the top 38,090 and judged them. There were 89 A's and 11 D's. The precision was 0.89. However, in Table 7, there were 98 A's and 2 D's for samples from the top 38,090 alignments. The precision was 0.98.</Paragraph> <Paragraph position="6"> This difference is also significant at 1% based on a one-sided proportional test.</Paragraph> <Paragraph position="7"> Thus, SntScore is more reliable than SIM. This high precision in SntScore indicates that it is important to take the similarities of article alignments into account when estimating the validity of sentence alignments.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Much work has been done on article alignment. Collier et al. (1998) compared the use of machine translation (MT) with the use of bilingual dictionary term lookup (DTL) for news article alignment in Japanese and English. They revealed that DTL is superior to MT at high-recall levels. That is, if we want to obtain many article alignments, then DTL is more appropriate than MT. In a preliminary experiment, we also compared MT and DTL for the data in Table 1 and found that DTL was superior to MT.11 These 11We translated the English articles into Japanese with an MT system. We then used the translated English articles as queries and searched the database consisting of Japanese articles. The direction of translation was opposite to the one described in Section 3.1. Therefore this comparison is not as objective as it could be. However, it gives us some idea into a comparison of MT and DTL.</Paragraph> <Paragraph position="1"> experimental results indicate that DTL is more appropriate than MT in article alignment.</Paragraph> <Paragraph position="2"> Matsumoto and Tanaka (2002) attempted to align Japanese and English news articles in the Nikkei Industrial Daily. Their method achieved a 97% precision in aligning articles, which is quite high. They also applied their method to NHK broadcast news.</Paragraph> <Paragraph position="3"> However, they obtained a lower precision of 69.8% for the NHK corpus. Thus, the precision of their method depends on the corpora. Therefore, it is not clear whether their method would have achieved a high accuracy in the Yomiuri corpus treated in this paper.</Paragraph> <Paragraph position="4"> There are two significant differences between our work and previous works.</Paragraph> <Paragraph position="5"> (1) We have proposed AVSIM, which uses similarities in sentences aligned by DP matching, as a reliable measure for article alignment. Previous works, on the other hand, have used measures based on bag-of-words.</Paragraph> <Paragraph position="6"> (2) A more important difference is that we have actually obtained not only article alignments but also sentence alignments on a large scale. In addition to that, we are distributing the alignment data for research and educational purposes. This is the first attempt at a Japanese-English bilingual corpus.</Paragraph> </Section> class="xml-element"></Paper>