File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1024_intro.xml
Size: 4,420 bytes
Last Modified: 2025-10-06 14:02:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1024"> <Title>Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> CMU Speech Pronunciation Dictionary to create a </SectionTitle> <Paragraph position="0"> series of weighted finite-state transducers between English words and katakana that produce and rank transliteration candidates. Using similar methods, Qu et al. (2003) showed that integrating automatically discovered transliterations of unknown katakana sequences, i.e. those not included in a large Japanese-English dictionary such as EDICT1, improves CLIR results.</Paragraph> <Paragraph position="1"> Transliteration of names between alphabetic and syllabic scripts has also been studied for languages such as Japanese/English (Fujii & Ishikawa, 2001), English/Korean (Jeong et al., 1999), and English/Arabic (Al-Onaizan and Knight, 2002).</Paragraph> <Paragraph position="2"> In work closest to ours, Meng et al (2001), working in cross-language retrieval of phonetically transcribed spoken text, studied how to transliterate names into Chinese phonemes (though not into Chinese characters). Given a list of identified names, Meng et al. first separated the names into Chinese names and English names.</Paragraph> <Paragraph position="3"> Romanized Chinese names were detected by a left-to-right longest match segmentation method, using the Wade-Giles2 and the pinyin syllable inventories in sequence. If a name could be segmented successfully, then the name was considered a Chinese name. As their spoken document collection had already been transcribed into pinyin, retrieval was based on pinyin-to-pinyin matching; pinyin to Chinese character conversion was not addressed. Names other than Chinese names were considered as foreign names and were converted into Chinese phonemes using a language model derived from a list of English-Chinese equivalents, both sides of which were represented in phonetic equivalents.</Paragraph> <Paragraph position="4"> [?] The work was done by the author while at The above English-to-Japanese or English-to-Chinese transliteration techniques, however, only solve a part of the name translation problem. In multilingual applications such as CLIR and Machine Translation, all types of names must be translated. Techniques for name translation from Latin scripts into CJK scripts often depend on the origin of the name. Some names are not transliterated into a nearly deterministic syllabic script but into ideograms that can be associated with a variety of pronunciations. For example, Chinese, Korean and Japanese names are usually written using Chinese characters (or kanji) in Japanese, while European names are transcribed using katakana characters, with each character mostly representing one syllable.</Paragraph> <Paragraph position="5"> In this paper, we describe a method for converting a Japanese name written with a Latin alphabet (or romanji), back into Japanese kanji3. Transcribing into Japanese kanji is harder than transliteration of a foreign name into syllabic katakana, since one phoneme can correspond to hundreds of possible kanji characters. For example, the sound &quot;kou&quot; can be mapped to 670 kanji characters.</Paragraph> <Paragraph position="6"> Our method for back-transliterating Japanese names from English into Japanese consists of the following steps: (1) language identification of the origins of names in order to know what language-specific transliteration approaches to use, (2) generation of possible transliterations using sound and kanji mappings from the Unihan database (to be described in section 3.1) and then transliteration validation through a three-tier filtering process by filtering first through a set of attested bigrams, then through a set of attested terms, and lastly through the Web.</Paragraph> <Paragraph position="7"> The rest of the paper is organized as follows: in section 2, we describe and evaluate our name origin identifier; section 3 presents in detail the steps for back transliterating Japanese names written in Latin script into Japanese kanji representations; section 4 presents the evaluation setup and section 5 discusses the evaluation results; we conclude the paper in section 6.</Paragraph> </Section> class="xml-element"></Paper>