File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1814_intro.xml

Size: 3,759 bytes

Last Modified: 2025-10-06 14:01:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1814">
  <Title>Extracting Pronunciation-translated Names from Chinese Texts using Bootstrapping Approach</Title>
  <Section position="2" start_page="1" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Pronunciation-translated names (P-Names) are those foreign names that are translated to Chinese characters according to their pronunciations. A P-Name sometimes forms part of but not a complete named entity. For instance, in the place name &amp;quot;g17137g1823g2045g3835g4410&amp;quot; (Berkeley University), only the term &amp;quot;g17137g1823g2045&amp;quot; (Berkeley) is a P-Name, while &amp;quot;g3835g4410&amp;quot; (University) is not since it is translated semantically.</Paragraph>
    <Paragraph position="1"> The ability to recognize P-Names helps to reduce ambiguities in word segmentation and improve the performance of Chinese information retrieval since many unknown words are P-Names, especially for international Chinese news. Unlike English, there is no blank between words in Chinese, in which a word is a linguistic token consisting of one or more characters. In addition, the same characters may appear in multiple context with different meanings (Chua and Liu, 2002). The presence of P-Names brings more ambiguities to Chinese word segmentation since every character in a P-Name can be used as a common character.</Paragraph>
    <Paragraph position="2"> Intuitively, we can extract the P-Names based on the distinctive sequence of characters that they are used as compared to common words. In addition, we can use local context around the P-Names to confirm and classify them into person or part of location and organization names. One way to perform these tasks effectively is to rely on statistics derived from a large corpus in which the P-Names are annotated.</Paragraph>
    <Paragraph position="3"> While some annotated corpuses with general named entities are available such as the PKUC (Yu, 1999) and MET-2 (Chinchor, 2001), there is no such annotated corpus for P-Names. While annotated data is difficult to obtain, un-annotated data is readily available and plentiful, especially on the Internet. To take advantage of that, we need to tackle two major problems. The first is how to gather sufficient distinct P-Names from the Internet, and the second is how to use the available resources to derive reliable statistical information to characterize the P-Names.</Paragraph>
    <Paragraph position="4"> The problem of gathering sufficient reliable information from a small initial set of seed resources has been tackled in bootstrapping research for information extraction (Agichtein and Gravano, 2000; Brin, 1998; Collins and Singer, 1999; Mihalcea and Moldovan, 2001; Riloff and Jones, 1999). Bootstrapping approach aims to perform unsupervised text processing to extract information from open resources such as the Internet using minimum manual labor. Given the lack of annotated training samples for P-Name extraction, this paper introduces a bootstrapping algorithm, called PN-Finder. It starts from a small set of seed samples, and iteratively locates, extracts and classifies the new and more P-Names. It works in conjunction with a general Chinese named entity recognizer (Chua and Liu, 2002) to extract general named entities.</Paragraph>
    <Paragraph position="5"> In the remaining parts of this paper, we describe the details of PN-Finder in Section 2 and its application in locating P-Names from new documents in Section 3. Section 4 presents the experimental results using the MET-2 test corpus.</Paragraph>
    <Paragraph position="6"> Section 5 contains our conclusion and outline for future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML