File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1503_metho.xml
Size: 18,294 bytes
Last Modified: 2025-10-06 14:08:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1503"> <Title>Construction and Analysis of Japanese-English Broadcast News Corpus with Named Entity Tags</Title> <Section position="3" start_page="3" end_page="8" type="metho"> <SectionTitle> NE translation-pairs. 2 Constructing a Japanese-English </SectionTitle> <Paragraph position="0"> broadcast news corpus with NE tags</Paragraph> <Section position="1" start_page="3" end_page="5" type="sub_section"> <SectionTitle> 2.1 Characteristics of the NHK </SectionTitle> <Paragraph position="0"> Japanese-English broadcast news corpus We are annotating an NHK broadcast news corpus with NE tags. The corpus is composed of Japanese news articles for domestic programs and English news articles translated for international broadcasting null and domestic bilingual programs .</Paragraph> <Paragraph position="1"> Figure 1 shows an example of a Japanese news article and its translation in English. The original Japanese article and the translated English article deal with the same topic, but they differ much in details. The difference arises from the following reasons (Kumano et al., 2002).</Paragraph> <Paragraph position="2"> Audience Content might be added or deleted, according to the audience, especially for international broadcasting.</Paragraph> <Paragraph position="3"> Broadcasting date The broadcasting of English news is often delayed compared to the original Japanese news. The time expressions might be changed sometimes or new facts might be added to the articles.</Paragraph> <Paragraph position="4"> News styles/languages Comparing news articles of two languages reveals that they have different presentation styles, for example, facts are sometimes introduced in a different order. The (There was a strong earthquake at 6:42 this morning in Izu Islands, the site of recent numerous earthquakes. An earthquake of a little less than five in seismic intensity was</Paragraph> <Paragraph position="6"> (In addition, an event of seismic intensity four was observed for Niijima and Kozu Island, events seismic intensity three for Toshima Island and Miyake Island, and events of seismic intensity two and one for various parts of Kanto Area and Shizuoka Prefecture.)</Paragraph> <Paragraph position="8"> (According to observations by the Meteorological Agency, the earthquake epicenter was located in the sea at a depth of ten kilometers near Niijima and Kozu Island. The magnitude of the earthquakes was estimated to be five point one.) (In Izu Islands, where seismic activity has been observed from the end of June, repeated cycles of seismic activity and dormancy have been observed. On the 30th of the previous month, a single strong earthquake having seismic intensity of a little less than six was observed at Miyake Island, while two earthquakes having seismic intensity of five</Paragraph> <Paragraph position="10"> (In a series of seismic events, seventeen earthquakes having seismic intensity over five have been observed up to this point, including strong tremors with a seismic intensity of a little less than six observed four times at Kozu Island, Niijima, and Miyake Island.) Translated article in English: 1: A strong earthquake jolted Shikine Island, one of the Izu islands south of Tokyo, early on Thursday morning.</Paragraph> <Paragraph position="11"> 2: The Meteorological Agency says the quake measured fiveminus on the Japanese scale of seven.</Paragraph> <Paragraph position="12"> 3: The quake affected other islands nearby.</Paragraph> <Paragraph position="13"> 4: Seismic activity began in the area in late July, and 17 quakes of similar or stronger intensity have occurred.</Paragraph> <Paragraph position="14"> 5: Officials are warning of more similar or stronger earthquakes around Niijima and Kozu Islands. null 6: Tokyo police say there have been no reports of damage from the latest quake.</Paragraph> <Paragraph position="15"> difference is due to language and socio-cultural backgrounds.</Paragraph> </Section> <Section position="2" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 2.2 NE tag design </SectionTitle> <Paragraph position="0"> We designed NE tags for NE translation-pair extraction research and working efficiency for manual annotation. The specifications are shown below.</Paragraph> <Paragraph position="1"> * It is desirable that NE recognition guidelines be consistent with NE tags of existing corpora.</Paragraph> <Paragraph position="2"> Past guidelines of MUC and IREX should be respected because they were configured as a result of many discussions. Consistent guidelines enable us to utilize existing annotated corpora and systems designated for the corpora.</Paragraph> <Paragraph position="3"> * Within each bilingual document pair, coreference between NEs in a language and across languages will be specified. When several NEs exist for the same referent in a document, it is not always possible to determine the actual translation for each instance of the NEs from the counterpart document, because our corpus is not composed of literal translations. Therefore, coreference between NEs in a language should be marked so that the coreference across languages can be assigned between NE groups that have the same referent. Coreference between NE groups is sufficient for our purpose.</Paragraph> <Paragraph position="4"> * Assignment of coreference in a language is limited between NEs only. Although NEs may have the same referent with pronouns or non-NE expressions, these elements are ignored to avoid complicating the annotation work.</Paragraph> </Section> <Section position="3" start_page="5" end_page="8" type="sub_section"> <SectionTitle> 2.3 Tag specifications </SectionTitle> <Paragraph position="0"> 1. The tag specifications conform to IREX NE tag specifications (IREX Committee, 1999) (an English description in (Sekine and Isahara, 1999)) as regards the markup form, NE classes, and NE recognition guidelines.</Paragraph> <Paragraph position="1"> Eight NE classes were defined at the IREX NE task -- the same 7 classes as MUC-7 (3 types of named entities in the narrow sense, 2 types of temporal expressions, and 2 types of number expressions), and ARTIFACT (concrete objects like commercial products and abstract objects such as laws or intellectual properties). Table 1 shows a list of these.</Paragraph> <Paragraph position="2"> 2. IREX's NE classes and NE recognition guidelines are applied to English for consistency between Japanese and English NEs. For English-specific annotation, such as prepositions or determiners in NE, the MUC-7 Named Entity Task Definition (Chinchor, 1997) is consulted .</Paragraph> <Paragraph position="3"> 3. The SGML markup form of the IREX tag is extended by adding the following two tag attributes, which represent coreference information in a language, and across languages. ID=&quot;NE group ID&quot; (mandatory) Each NE is assigned an attribute ID and an ID number as its value. All coreferent NEs in each language document are The tag specifications of IREX NE and those of MUC-7 do not differ radically, because IREX NE tags are designed based on the discussions of MUC.</Paragraph> <Paragraph position="4"> given the same ID number . The same ID number is assigned to NEs that have different forms, such as the full name and the first name or the official name and the abbreviated form, in addition to NEs with the same form. Basically, NE are assigned the same ID number when they belong to an NE class and have the identical surface form .</Paragraph> <Paragraph position="5"> COR=&quot;ID for corresponding NE groups in the other language&quot; (optional) When there exists a corresponding NE (group) belonging to the same NE class in the other language, an attribute COR is given to each NE (group) in both languages, and the ID number for the counterpart is assigned as a value to each other. Annotations by the specifications are illustrated in Figure 2.</Paragraph> </Section> <Section position="4" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.4 Current status of the corpus annotation </SectionTitle> <Paragraph position="0"> Annotators who have experience in translation work and in the production of linguistic data are engaging in the tag annotation. Plans call for a total of 2,000 article pairs to be annotated, and about 1,100 pairs have been finished up to the present.</Paragraph> </Section> <Section position="5" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.5 Problems </SectionTitle> <Paragraph position="0"> Some problems became obvious in the course of discussions of tag specifications and tag annotation work. They confuse annotators and make the result inaccurate. Typical cases are shown below.</Paragraph> </Section> <Section position="6" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 2.5.1 The granularity difference between Japanese and English </SectionTitle> <Paragraph position="0"> In Japanese, a unit smaller than a morpheme may be accepted as an NE according to IREX guidelines.</Paragraph> <Paragraph position="1"> ID numbers do not maintain uniqueness across the documents. null There are some exceptions. See Section 2.5.3. (last Sunday and this Sunday) On the other hand, English does not accept any unit smaller than a word by MUC-7 guidelines. Some Japanese NEs cannot have a counterpart English NE, even if they have a corresponding English expression because of the difference in the segmentation granularity. For example, &quot;(amerika; America)&quot; in the Japanese morpheme &quot; (amerika-jin; America-people)&quot; is treated as an NE, while no NE can be tagged to &quot;American&quot;, the English counterpart of &quot; .&quot; NEs have the same problem that translation in general has: What is the exact translation word(s) for an expression? * Semantically corresponding expressions may not be assigned corresponding NE relations, because they belong to different NE classes or an expression in a language is not recognized as an NE. For example, a non-NE word &quot; S (seifu; government)&quot; which means Japanese government in Japanese articles is often translated as the English NE: &quot;Japan.&quot; * A non-literal translation of an NE may cause difficulty in recognizing corresponding relations. Correspondences for some expressions cannot be decided with the information represented in documents: Relative temporal expressions in Japanese are often translated as absolute expressions in English and those correspondences cannot be identified without consulting the calendar; Money expressions are generally converted to dollars and the exchange rate at the relative time is needed to confirm correspondences. For example, we found a translation pair of money expressions &quot;~ /(sanzen-oku-en; three hundred billion yen)&quot; and &quot;three billion U-S dollars&quot; in our corpus, which constitutes a rough conversion from yen into dollars when the articles were produced.</Paragraph> <Paragraph position="2"> We defined NEs that have the identical surface form and the same NE class to be coreferent and assigned the same NE group ID, in order to make coreference judgment easier. There are some cases where we cannot apply this rule, especially to temporal expressions or number expressions.</Paragraph> <Paragraph position="3"> The example in Figure 3 shows the translation pair &quot; ?w5q?w5(last Sunday and this Sunday)&quot; and &quot;last Sunday and this Sunday&quot; annotated with NE tags. Japanese temporal expressions &quot; ?w5(last Sunday)&quot; and &quot;?w5(this Sunday)&quot; are translated into English as &quot;last Sunday&quot; and &quot;this Sunday&quot; respectively. When annotating NE tags for this translation pair, only &quot;5 (Sunday)&quot; in those temporal expressions in Japanese is regarded as an NE according to the IREX's NE specifications. This causes a problem in which the two NEs of the same surface form that are assigned the same NE class have different referents. Each of them should assign correspondence to different NEs in the counterpart: the former to &quot;last Sunday&quot; and the latter to &quot;this Sunday.&quot; Tentatively, we allowed a different NE group ID to be assigned to an NE with the identical surface form in an NE class, as shown in Figure 3. It would be better reexamine the consistency of the NE tag specification between Japanese and English, and the necessity of coreference information for temporal expressions and number expressions.</Paragraph> </Section> </Section> <Section position="4" start_page="8" end_page="8" type="metho"> <SectionTitle> 3 Analysis </SectionTitle> <Paragraph position="0"> We conducted an elementary investigation into 1,096 pairs of annotated Japanese and English articles. null</Paragraph> <Section position="1" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 3.1 Corpus size </SectionTitle> <Paragraph position="0"> Table 2 shows the content size of our corpus by the number of sentences and the morphemes/words.</Paragraph> <Paragraph position="1"> The content decreases significantly when translating from Japanese to English. This fact points out that the content tends to be lost through the translation process.</Paragraph> </Section> <Section position="2" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 3.2 In-language characteristics of NE occurrences 3.2.1 Frequency </SectionTitle> <Paragraph position="0"> The number of occurrences for each NE class is listed in Table 3. The distribution of NE classes is almost the same as that in the data for MUC-7 or IREX.</Paragraph> <Paragraph position="1"> By comparing the decrease in content (cf. Table 2), the number of NE tokens also decreases for translations. However, the degree of the NE decrease is less than that of the morphemes/words. It is also remarkable that the number of NE types is fairly well preserved. Notice that only a small number of tokens in the NE class TIME appear in English. The reason may be that detailed time information may become less important for English articles, which are intended for audiences outside of Japan and broadcast later than the original Japanese articles.</Paragraph> <Paragraph position="2"> 3.2.2 NE characteristics within NE groups To examine the surface form distribution in the same NE groups, we counted the number of members ( freq) and sorts of surface form (sort) for each NE group in each article. The probability that a given member has a unique surface form in a group groups that has two or more members (uniq) has also been calculated as follows: Table 4 shows the values averaged for all the NE groups that appeared in all articles.</Paragraph> <Paragraph position="3"> In English, a repetition of the same expression is not conventionally desirable. Therefore, pronouns or paraphrases are used frequently. On the other hand, Japanese does not have such a convention. This difference is considered to be the reason for the result shown in Table 4: freq in English is smaller than that in Japanese, and sort in English is larger than that in Japanese. As a result, uniq in English is higher than that in Japanese. These tendencies differ slightly according to the NE classes.</Paragraph> <Paragraph position="4"> * The sort of English PERSON is notably large. In English, the name of a person is usually first expressed in full, and after that, it tends to be expressed only by the family name. In Japanese, only the family name is generally used from the beginning, especially for well-known persons.</Paragraph> <Paragraph position="5"> groups (only for those having cross-language correspondences) null * The uniq of English MONEY is quite high. A money expression in Japanese tends to be translated into English as both the original currency (usually yen) and dollars.</Paragraph> <Paragraph position="6"> * The freq of temporal and number expressions are smaller than those of named entities in the narrow sense.</Paragraph> </Section> <Section position="3" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 3.3 Cross-language characteristics of NE occurrences 3.3.1 Correspondence across languages </SectionTitle> <Paragraph position="0"> We calculated the rates for a given NE in a document to have a corresponding NE in the counterpart language. The units of NE correspondences we used for these calculations are both NE token and NE group (type). The results, shown in Table 5, show that an NE that appeared in English will have a Japanese NE correspondent with a high rate.</Paragraph> <Paragraph position="1"> We also conducted the same survey as we did in Table 4 for only NEs having cross-language coreferences, whose results are shown in Table 6. A comparison of both results shows that the freq for only NEs having cross-language coreferences is larger, especially in Japanese. An NE occurring more times in an article may have more important information and is more likely to appear in the translation.</Paragraph> </Section> <Section position="4" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 3.3.2 Preservation of NE order </SectionTitle> <Paragraph position="0"> We investigated how well the order of NEs occurring in an article is preserved in the counterpart language as follows: 1. In every article, we eliminated all NEs except the first occurrence of every NE group. 2. We calculated the ratio between all of the pos null sible NE pairs in the source language and those translated into the target language with the same order of occurrence.</Paragraph> <Paragraph position="1"> Table 7 lists the average preservation ratios of the NE order for all NEs (&quot;All&quot;) and for NEs having corresponding NEs in the counterpart (&quot;Corr. only&quot;). The scores labeled &quot;All NEs&quot; express ratios for the order of all NEs. The preservation ratio for each NE class is listed below in the table. The NE orders are preserved so well even for all NEs that they can be used for determining cross-language correspondences. null</Paragraph> </Section> </Section> class="xml-element"></Paper>