File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0104_metho.xml
Size: 13,751 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0104"> <Title>GeoName: a system for back-transliterating pinyin place names</Title> <Section position="3" start_page="0" end_page="3" type="metho"> <SectionTitle> 2 Pinyin Place Names </SectionTitle> <Paragraph position="0"> The mapping from Chinese character to Pinyin is more or less unique because the majority of Chinese characters have only one sound (with some exceptions).</Paragraph> <Paragraph position="1"> Given a Pinyin, however, there can be many homophonic candidate characters depending on which sound it is. When one encounters such a Pinyin entity ambiguity can arise. Even if the context specifies the place precisely, there is still uncertainty as to its original character representation. This is true for all entity types rendered into Pinyin unless they are well known. As an example, the capital of China, Beijing, originates from the characters: GEC GC6 Bei; G81 GC6 jing.</Paragraph> <Paragraph position="2"> However, when back-transliterating from the English, the following are some of the possible mappings: place names in addition to the intended one. In fact, these two are highly fertile Pinyin: 'Bei' maps to 23 and 'jing' maps to 20, leading to a total of 460 possible pairs. Many of the pairs of course may not be used as place names.</Paragraph> <Paragraph position="3"> It is possible to diminish the above ambiguity by capturing also the tone of a Pinyin character as is done in most Chinese input systems that accept Pinyin as input. The simplest convention has five tones. One tone can be assigned to each character represented as Pinyin, and this can separate the mapped characters into tonal sets. However, most printed or electronic texts such as newspapers or newswires do not have tones assigned.</Paragraph> <Paragraph position="4"> Our system assumes input texts have no tonal indication, and so can be adapted to online text processing.</Paragraph> <Paragraph position="5"> Chinese place names are mostly two to three characters long. Four-character names exist and longer ones are possible. Unlike person names where the family name character is selected from a fairly closed set, character use is practically unrestricted for places. This means that when mapping a Pinyin representation into its original Chinese format, one can result in x^y candidates, where y is the average number of possible single character mappings for each of x syllables. To further complicate the issue, place names in Pinyin can be separated with white spaces or not. For example, the representation for GBBG5CG70, a place near Beijing, can be written as: 'Qin Huang Dao', 'QinHuangDao' or 'Qinhuangdao'. The first item shows the original character one by one separated by a white space. The second item is a composite Pinyin denoting that the three individual Pinyin should be treated as a single entity. Each individual Pinyin character however is initialized with a capital letter. The third item is like the second composite but without capital letter except for the first character. (For example, on 3/25/03, the New York Times reported a coalmine explosion at 'Mengnanzhuang' employing this style.) All three styles can be found in texts. The first two indicate unique segmentation of the Pinyin characters. The third style however presents the additional problem of segmentation: how to recover the characters correctly.</Paragraph> <Paragraph position="6"> The string 'Qinhuangdao' may be broken up as 'Qin huang dao', 'Qin huang da o', 'Qin hu ang dao', etc.</Paragraph> <Paragraph position="7"> because it so happens that the listed components -- call them syllables -- are all legitimate Pinyin. Thus, the 'Qinhuangdao' composite can be either a three-, four- or five-character entity. One can imagine the exponential increase in candidates if each Pinyin syllable maps back to ~10 possibilities, for example. There is a fourth style that employs an apostrophe to indicate syllable separation in case of extreme ambiguity such as: Xian (G94 province) and Xi'an (G54G5E the city). This is very useful, like style one or two. Unfortunately, none of these is mandatory.</Paragraph> <Paragraph position="8"> probability: P(C|E) = P(E|C)*P(C)/P(E). P(E) can be ignored, and P(E|C) is reduced to a product of P(e</Paragraph> <Paragraph position="10"> ) to a constant, leaving the unknown P(C). If one has sufficient bilingual translation of place names, the neglected probability P(e</Paragraph> <Paragraph position="12"> Hence P(C|E) is roughly reduced to P(C) up to a constant. The most probable Chinese character sequence corresponding to the input Pinyin E is therefore equal to the one argmax P(C), or P(C) can be used to rank candidates C. To estimate P(C), we initially used a</Paragraph> <Paragraph position="14"> which turns out to be less effective than the following heuristic approach. Instead of probability, we work with occurrence frequencies of the string itself, bigrams, and single characters. The function for ranking is</Paragraph> <Paragraph position="16"> where f(.) is frequency, and the sums run over all consecutive bigrams and singles composing the string C, and a</Paragraph> <Paragraph position="18"> , i=1,..3 are constants, which are larger for longer strings. A factor is not counted if its f(.) is zero. When string C has been seen before, its effect is larger if the length of C is longer. If C does not exist, its component bigram and single character frequencies determine the ranking value g(C). It is generally true that for a character string matching some dictionary entries or previous use, the longer the length, the more legitimate it is.</Paragraph> <Paragraph position="19"> The issues raised in Section 2 are addressed as follows. Many Pinyin of the third style do lead to unique segmentation. For those that do not, all possible segmentations are captured, but they are sorted longest spelling sequence (and minimum syllables) first: e.g. in the previous example, 'Qin huang dao' is preferred over 'Qin hu ang dao'. The candidates (c are limited to all possible combinations of characters that exist in the training data and can be mapped from the segmented Pinyin. Because of limitation of hardware, our prototype currently limits the number of Pinyin syllables to four in order to cut down on the number of candidates for certain input.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 GeoName </SectionTitle> <Paragraph position="0"> GeoName is designed to accept a Pinyin place name and suggest Chinese GB-encoded candidates for it.</Paragraph> <Paragraph position="1"> Back-transliteration is an ambiguous and inaccurate process. Also, non-standard romanization exists historically for many common places names. The system does not yet have the capability to extract such names from running text, but requires that each name be entered on a separate line. Each Pinyin name is subjected to segmentation and character mapping, and a set of candidate GB-encoded Chinese names is produced as discussed in Section 2 and 3. GeoName employs a three-step process to effect back-transliteration: 1) table lookup on a bilingual place name list; 2) suggest names based on frequency usage of place characters and pairs; 3) confirmation via web retrieval or a monolingual geographic list. The following sub-sections present details of our approach.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 Bilingual Place Name List </SectionTitle> <Paragraph position="0"> Geographic entities tend not to change much over time, and the number of places is relatively fixed, unlike person name for example. Thus, it is a good strategy to produce a lookup table to map place names between Chinese and English. It will give accurate translation; it can handle 1:m mappings well when a Chinese name may be represented differently due to different systems of romanization, and is very efficient in real time computation. The disadvantages are that it is difficult to locate such a bi-list, it will not be complete, relatively fixed, and it cannot suggest possible new names that are not on the list. We think such a list is an important component of any system that tries to do this kind of mappings, as there would always be many well-known places that have non-standard or peculiar romanization.</Paragraph> <Paragraph position="1"> From ftp://ftpserver.ciesin.columbia.edu/pub/data/China /CITAS/gb_code/ we located such a bi-list that contains about 4K unique Chinese place names. This we call List-A. Using the English Pinyin as key, a direct hit on this list will provide most probably the correct translation for the input. The first bit (A-bit) of a 3-bit tag would be set to 1, thus 100. The tag is attached to each candidate.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.2 Place Name Suggestion </SectionTitle> <Paragraph position="0"> The total number of GB-encoded characters is about 6,000, but around 2,500 are the most often used. Since we limit our domain to geographical names here, we can collect such names in monolingual Chinese text and estimate the probabilities for single and paired Chinese characters use in this context. We employed similar methods in our PYName system for person names and it worked reasonably well. However, unlike person names where many people may share the same name characters, geographic names tend to be relatively more unique, i.e.</Paragraph> <Paragraph position="1"> not too many places have similar characters in our training data. Thus, the effectiveness of using frequency to suggest GB-encoded place names based on a given Pinyin name in English is more limited. This is compounded by the difficulty of finding a sufficiently large name list. The main advantage of the probabilistic mapping exercise is to be able to suggest new names as candidates by composing with characters, and rank them according to how characters appear in the monolingual name list as discussed in Section 3.</Paragraph> <Paragraph position="2"> The ranking formula in Eqn.(1) has to be estimated from some training data. We failed to find sufficient downloadable Chinese place names and employed BBN's IdentiFinder (Miller, et.al. 1999) that brackets location entities in running text. The collections used are from the TREC and NTCIR experiments. Location names were identified and extracted. The result is about 80K &quot;approximate place names&quot; called List-B. The software is not perfect and many entries are not place names, or contain several names together. But the data can still serve its purpose.</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.3 Name Confirmation </SectionTitle> <Paragraph position="0"> To improve the accuracy of candidate ranking obtained in Section 4.2, we further use a process of confirmation.</Paragraph> <Paragraph position="1"> The hypothesis is that if a GB-encoded place name candidate has been seen before, it has a high probability of being correct. Each candidate name is compared to the monolingual Chinese name list consisting of (List-A U List-B). If it exists, the second bit (B-bit) of the 3-bit tag is set giving 010.</Paragraph> <Paragraph position="2"> However, as suggested before, name lists are seldom complete. To mitigate this problem, we also utilize the Word Wide Web for confirmation. The basic idea is to treat WWW as another name collection, but a dynamic one. The English Pinyin name is treated as a query and sent to a search engine (such as Google). By using the advanced search option to return GB-encoded documents, each candidate of the Pinyin is searched in the documents to confirm whether it has been used as a sub-string. If true, the third C-bit of the tag is set giving 001. Another benefit of using the WWW is to resolve some dialect-based problems. As an example, both 'Hong Kong' and 'Xiang Gang' as Pinyin place names have been found on web documents with the Chinese name G6E G04 confirmed. However, we do have to pay a price on performance, since web searches are relatively slow.</Paragraph> <Paragraph position="3"> Another draw back is that, web confirmation is effective only on popular, well-known names. Otherwise, domain specific name lists can be used if available.</Paragraph> <Paragraph position="4"> Thus, all candidates are tagged and rank value assigned. Our current strategy is to rank candidates by the 3-bit tag first, followed by minimum syllable number, and then by g(C) of Eqn.1. If a candidate is confirmed somewhere, especially on our bi-list, it will be a good translation. Otherwise, shorter names are preferred.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.4 System Description </SectionTitle> <Paragraph position="0"> Fig.1 below is a flowchart of GeoName showing how the different functions are tied together. Steps 2, 5 and 6 for bi-list lookup and confirmation can be enabled or disabled. Although our main focus is on Pinyin input, GeoName does have limited support in Step 3 for other Fig.1. GeoName System Flowchart romanization systems such as Wade-Giles and Hong Kong Pinyin. The system allows selection if the input romanization convention is known. A table converts Wade-Giles spelling into PRC Pinyin. For Hong Kong style spelling, another table converts it directly into GB character. Example back-transliterations are shown on the GUI screen of GeoName in Fig.2. The 1 st and 3 rd names are correct at rank 1, the 2 nd at rank 2.</Paragraph> </Section> </Section> class="xml-element"></Paper>