File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0104_intro.xml
Size: 4,398 bytes
Last Modified: 2025-10-06 14:01:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0104"> <Title>GeoName: a system for back-transliterating pinyin place names</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Names referring to entities can be ambiguous because different entities may have been given the same name.</Paragraph> <Paragraph position="1"> When one encounters foreign place names within English texts, further complication arises because the English alphabet may not represent the native writing uniquely or adequately, and transliteration has to be employed. This is true for Chinese place names. In this information age, documents on Chinese events such as news stories, commentaries, reviews, analysis, can originate from various sources and languages other than Chinese. Authors may reference Chinese place names but not necessary accompany it with the actual Chinese characters. It is therefore useful to build an automatic algorithm to decode such a place name in English and map it to the original Chinese character representation.</Paragraph> <Paragraph position="2"> Chinese language is written as a contiguous string of ideographs (characters) without white space.</Paragraph> <Paragraph position="3"> Geographic names of most cities, provinces, mountains, etc are two to four characters long. Border regions have longer place names. Unlike person names, there is not a preferred closed set for family name characters. Any of the over 6K GB-encoded character is theoretically admissible as part of a place name. When one refers to them in English text, one needs to represent them using English alphabets - a process of romanization. Two main systems exist for this process: Pinyin, official in Mainland China, and Wade-Giles convention, popular in Taiwan (see e.g. http://www.romanization.com).</Paragraph> <Paragraph position="4"> Their objective is to spell out the pronunciation of the Chinese characters with alphabets. Unfortunately, although written Chinese is by and large uniform (except for a few hundred characters that have simplified vs. traditional forms), spoken Chinese can vary from region to region with different dialects. The Pinyin system was introduced by the PRC government in the 1950's. It attempts to standardize the representation according to the official Beijing Potunghua dialect (Northern China Mandarin) for the whole country. The Wade-Giles system is an older convention designed by authors of the same names in the late 19 th and early 20 th century, and is popular in Taiwan and some parts of South-East Asia.</Paragraph> <Paragraph position="5"> There are also other haphazard romanization conventions in different regions where Chinese is used. For example, Hong Kong has its own British colonial history and Southern (GuangDong) dialect, and entity names are often spelt differently. The representation 'Hong Kong' itself is neither Pinyin nor Wade-Giles. It should have been 'Xiang Gang' in the former, and 'Hsiang Kang' in the latter. This is also true for 'Singapore'. In this investigation, we will mainly focus on the Pinyin convention. This is used by most of the Chinese (PRC) and there has been discussion in Taiwan to adopt it even though there are still political obstacles around this issue. There is evidence that this system is gaining popularity in the U.S. as the default choice (Library of Congress 2000).</Paragraph> <Paragraph position="6"> This paper investigates methods of recovering a Chinese place name in GB-encoding (the character codes used for simplified Chinese characters) when its English Pinyin is given. We have previously built a system PYName (Kwok and Deng 2002) to back-transliterate Chinese person names. This paper extends it to provide similar functionality for place names. It is a tool to help reduce ambiguity in cross language geographic entity reference, and would be useful for cross language information retrieval. The organization of this paper is as follows: Section 2 discusses some properties of Pinyin place names. Section 3 discusses the use of frequencies to help back-transliteration. Section 4 describes GeoName, our system to map English Pinyin place names to Chinese characters. Section 5 contains some evaluation of Geoname, and Section 6 contains our conclusion and future work.</Paragraph> </Section> class="xml-element"></Paper>