File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1503_intro.xml
Size: 5,593 bytes
Last Modified: 2025-10-06 14:01:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1503"> <Title>Construction and Analysis of Japanese-English Broadcast News Corpus with Named Entity Tags</Title> <Section position="2" start_page="0" end_page="3" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Studies on named entity (NE) extraction are making progress for various languages, such as English and Japanese. A number of evaluation workshops have been held, including the Message Understanding Conference (MUC) for English and other languages, and the Information Retrieval and Extraction Exercise (IREX) for Japanese. Extraction accuracy for English has reached a nearly practical level (Marsh and Perzanowski, 1998). As for Japanese, it is more difficult to find NE bound- null http://nlp.cs.nyu.edu/irex/ aries, however, NE extraction is relatively accurate (Sekine and Isahara, 2000).</Paragraph> <Paragraph position="1"> Most of the past research on NE extraction used monolingual corpora, but the application of NE extraction techniques to bilingual (or multilingual) corpora is expected to obtain NE translation pairs. We are developing a Japanese-English machine translation system for documents including many NEs, such as news articles or documents about current topics. Translating NE correctly is indispensable for conveying information correctly. NE translations, however, are not listed in conventional dictionaries. It is necessary to retrieve NE translation knowledge from the latest bilingual documents.</Paragraph> <Paragraph position="2"> When extracting translation knowledge from bilingual corpora, using literally translated parallel corpora, such as official documents written in several languages makes it easier to get the desired information. However, not many of such corpora contain the latest NEs. There are few Japanese-English corpora which are translated literally. Therefore, we decided to extract NE translation pairs from content-aligned corpora, such as multilingual broadcast news articles including new NEs daily, which are not literally translated.</Paragraph> <Paragraph position="3"> Sentential alignment (Brown et al., 1991; Gale and Church, 1993; Kay and R&quot;oscheisen, 1993; Utsuro et al., 1994; Haruno and Yamazaki, 1996) is commonly used as a starting point for finding the translations of words or expressions from bilingual corpora. However, it is not always possible to correspond non-parallel corpora in sentences. Past statistical methods for non-parallel corpora (Fung and Yee, 1998) are not valid for finding translations of words or expressions with low frequency. These methods have a problem in covering NEs because there are many NEs that appear only once in a corpus. So we need a specialized method for extracting NE translation pairs. Transliteration is used for finding the translations of NE in the source language from texts in the target language (Stalls and Knight, 1998; Goto et al., 2001; Al-Onazian and Knight, 2002). Transliteration is useful for the names of persons and places; however, it is not applicable to all sorts of NEs.</Paragraph> <Paragraph position="4"> Content-aligned documents, such as a bilingual news corpus, are made to convey the same topics. Since NEs are the essential element of document contents, content-aligned documents are likely to share NEs pointing to the same objects. Consequently, when extracting all NEs with NE class information from each of a pair of bilingual documents separately by applying monolingual NE extraction techniques, the distribution of the NEs in each document may be similar enough to recognize correspondences between the NE translation pairs.</Paragraph> <Paragraph position="5"> A technique for finding bilingual NE correspondences will have a wide range of applications other than NE translation-pair extraction. For example, * Bilingual NE correspondences have clues for identifying corresponding parts in a pair of noisy bilingual documents.</Paragraph> <Paragraph position="6"> * The similarity of any two documents in different languages can be estimated by NE translation-pair correspondence.</Paragraph> <Paragraph position="7"> For this research, we obtained a Japanese-English broadcast news corpus (Kumano et al., 2002) by the Japanese broadcast company NHK , and we are manually tagging NEs in the corpus to analyze it and to conduct NE translation-pair extraction experiments. null The tag specifications are based on the IREX NE task (Sekine and Isahara, 1999), the evaluation workshop of Japanese NE extraction. We extended the specifications to English NEs. In addition, coreference information between NEs, within the same monolingual document and between the corresponding Japanese-English document pairs (henceforth, Nippon Hoso Kyokai (Japan Broadcasting Corporation) (http://www.nhk.or.jp/englishtop/) we call these in a language and across languages, respectively), is added to each of the tagged NEs, for NE translation-pair extraction studies.</Paragraph> <Paragraph position="8"> In Section 2, we will introduce the bilingual corpus used in this study and describe its characteristics. Then, we will discuss tag design for NE extraction studies, and explain the tag specifications and existing problems. The current status of corpus annotation under these specifications will also be introduced. We analyzed an annotated part of the corpus in terms of NE occurrence and translation. This analysis will be shown in Section 3. In Section 4, we will mention future plans for the extraction of</Paragraph> </Section> class="xml-element"></Paper>