File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1010_intro.xml

Size: 5,431 bytes

Last Modified: 2025-10-06 14:01:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1010">
  <Title>Reliable Measures for Aligning Japanese-English News Articles and Sentences</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Newspapers Aligned
</SectionTitle>
    <Paragraph position="0"> The Japanese and English newspapers used as source data were the Yomiuri Shimbun and the Daily Yomiuri. They cover the period from September 1989 to December 2001. The number of Japanese articles per year ranges from 100,000 to 350,000, while English articles ranges from 4,000 to 13,000.</Paragraph>
    <Paragraph position="1"> The total number of Japanese articles is about 2,000,000 and the total number of English articles is about 110,000. The number of English articles represents less than 6 percent that of Japanese articles. Therefore, we decided to search for the Japanese articles corresponding to each of the English articles. The English articles as of mid-July 1996 have tags indicating whether they are translated from Japanese articles or not, though they don't have explicit links to the original Japanese articles. Consequently, we only used the translated English articles for the article alignment. The number of English articles used was 35,318, which is 68 percent of all of the articles. On the other hand, the English articles before mid-July 1996 do not have such tags. So we used all the articles for the period. The number of them was 59,086. We call the set of articles before mid-July 1996 &amp;quot;1989-1996&amp;quot; and call the set of articles after mid-July 1996 &amp;quot;1996-2001.&amp;quot; If an English article is a translation of a Japanese article, then the publication date of the Japanese article will be near that of the English article. So we searched for the original Japanese articles within 2 days before and after the publication of each English article, i.e., the corresponding article of an English article was searched for from the Japanese articles of 5 days' issues. The average number of English articles per day was 24 and that of Japanese articles per 5 days was 1,532 for 1989-1996. For 1996-2001, the average number of English articles was 18 and that of Japanese articles was 2,885. As there are many candidates for alignment with English articles, we need a reliable measure to estimate the validity of article alignments to search for appropriate Japanese articles from these ambiguous matches.</Paragraph>
    <Paragraph position="2"> Correct article alignment does not guarantee the existence of one-to-one correspondence between English and Japanese sentences in article alignment because literal translations are exceptional. Original Japanese articles may be restructured to conform to the style of English newspapers, additional descriptions may be added to fill cultural gaps, and detailed descriptions may be omitted. A typical example of a restructured English and Japanese article pair is: Part of an English article: he1i Two bullet holes were found at the home of Kengo Tanaka, 65, president of Bungei Shunju, in Akabane, Tokyo, by his wife Kimiko, 64, at around 9 a.m. Monday. h/e1ihe2i Police suspect right-wing activists, who have mounted criticism against articles about the Imperial family appearing in the Shukan Bunshun, the publisher's weekly magazine, were responsible for the shooting. h/e2ihe3i Police received an anonymous phone call shortly after 1 a.m. Monday by a caller who reported hearing gunfire near Tanaka's residence. h/e3ihe4i Police found nothing after investigating the report, but later found a bullet in the Tanakas' bedroom, where they were sleeping at the time of the shooting. h/e4i Part of a literal translation of a Japanese article: hj1i At about 8:55 a.m. on the 29th, Kimiko Tanaka, 64, the wife of Bungei Shunju's president Kengo Tanaka, 65, found bullet holes on the eastern wall of their two-story house at 4 Akabane Nishi, Kitaku, Tokyo.h/j1ihj2i As a result of an investigation, the officers of the Akabane police station found two holes on the exterior wall of the bedroom and a bullet in the bedroom.h/j2ihj3i After receiving an anonymous phone call shortly after 1 a.m. saying that two or three gunshots were heard near Tanaka's residence, police officers hurried to the scene for investigation, but no bullet holes were found.h/j3ihj4iWhen gunshots were heard, Mr. and Mrs. Tanaka were sleeping in the bedroom.h/j4ihj5i Since Shukan Bunshun, a weekly magazine published by Bungei Shunju, recently ran an article criticizing the Imperial family, Akabane police suspect right-wing activists who have mounted criticism against the recent article to be responsible for the shooting and have been investigating the incident.h/j5i where there is a three-to-four correspondence between fe1;e3;e4g and fj1;j2;j3;j4g, together with a one-to-one correspondence between e2 and j5.</Paragraph>
    <Paragraph position="3"> Such sentence matches are of particular interest to researchers studying human translations and/or stylistic differences between English and Japanese newspapers. However, their usefulness as resources for NLP such as machine translation is limited for the time being. It is therefore important to extract sentence alignments that are as literal as possible.</Paragraph>
    <Paragraph position="4"> To achieve this, a reliable measure of the validity of sentence alignments is necessary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML