File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1002_metho.xml

Size: 20,941 bytes

Last Modified: 2025-10-06 14:10:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1002">
  <Title>The Role of Lexical Resources in CJK Natural Language Processing</Title>
  <Section position="5" start_page="9" end_page="12" type="metho">
    <SectionTitle>
3 Linguistic Issues in Chinese
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
3.1 Processing Multiword Units
</SectionTitle>
      <Paragraph position="0"> A major issue for Chinese segmentors is how to treat compound words and multiword lexical units (MWU), which are often decomposed into their components rather than treated as single units. For example, Lu Xiang Dai luxiangdai 'video cassette' and Ji Qi Fan Yi ji qifa nyi 'machine translation' are not tagged as segments in Chinese Gigaword, the largest tagged Chinese corpus in existence, processed by the CKIP morphological analyzer (Ma 2003). Possible reasons for this include:  1. The lexicons used by Chinese segmentors are small-scale or incomplete. Our testing of various Chinese segmentors has shown that coverage of MWUs is often limited.</Paragraph>
      <Paragraph position="1"> 2. Chinese linguists disagree on the concept of wordhood in Chinese. Various theories such as the Lexical Integrity Hypothesis (Huang 1984) have been proposed. Packard's outstanding book (Packard 98) on the subject clears up much of the confusion.</Paragraph>
      <Paragraph position="2"> 3. The &amp;quot;correct&amp;quot; segmentation can depend on the  application, and there are various segmentation standards. For example, a search engine user looking for Lu Xiang Dai is not normally interested in Lu Xiang 'to videotape' and Dai 'belt' per se, unless they are part of Lu Xiang Dai .</Paragraph>
      <Paragraph position="3"> This last point is important enough to merit elaboration. A user searching for Zhong Guo Ren zho ngguoren 'Chinese (person)' is not interested in Zhong Guo 'China', and vice-versa. A search for Zhong Guo should not retrieve Zhong Guo Ren as an instance of Zhong Guo . Exactly the same logic should apply to Ji Qi Fan Yi , so that a search for that keyword should only retrieve documents containing that string in its entirety. Yet performing a Google search on Ji Qi Fan Yi in normal mode gave some 2.3 million hits, hundreds of thousands of which had zero occurrences of Ji Qi Fan Yi but numerous  occurrences of unrelated words like Ji Qi Ren 'robot', which the user is not interested in.</Paragraph>
      <Paragraph position="4"> This is equivalent to saying that headwaiter should not be considered an instance of waiter, which is indeed how Google behaves. More to the point, English space-delimited lexemes like high school are not instances of the adjective high. As shown in Halpern (2000b), &amp;quot;the degree of solidity often has nothing to do with the status of a string as a lexeme. School bus is just as legitimate a lexeme as is headwaiter or wordprocessor. The presence or absence of spaces or hyphens, that is, the orthography, does not determine the lexemic status of a string.&amp;quot; In a similar manner, it is perfectly legitimate to consider Chinese MWUs like those shown below as indivisible units for most applications, especially information retrieval and machine translation.</Paragraph>
      <Paragraph position="5">  start to prepare at the last moment One could argue that Ji Qi Fan Yi is compositional and therefore should be considered &amp;quot;two words.&amp;quot; Whether we count it as one or two &amp;quot;words&amp;quot; is not really relevant - what matters is that it is one lexeme (smallest distinctive units associating meaning with form). On the other extreme, it is clear that idiomatic expressions like Lin Zhen Mo Qiang , literally &amp;quot;sharpen one's spear before going to battle,&amp;quot; meaning 'start to prepare at the last moment,' are indivisible units.</Paragraph>
      <Paragraph position="6"> Predicting compositionality is not trivial and often impossible. For many purposes, the only practical solution is to consider all lexemes as indivisible. Nonetheless, currently even the most advanced segmentors fail to identify such lexemes and missegment them into their constituents, no doubt because they are not registered in the lexicon. This is an area in which expanded lexical resources can significantly improve segmentation accuracy.</Paragraph>
      <Paragraph position="7"> In conclusion, lexical items like Ji Qi Fan Yi 'machine translation' represent stand-alone, well-defined concepts and should be treated as single units. The fact that in English machineless is spelled solid and machine translation is not is an historical accident of orthography unrelated to the fundamental fact that both are full-fledged lexemes each of which represents an indivisible, independent concept. The same logic applies to Ji Qi Fan Yi , which is a full-fledged lexeme that should not be decomposed.</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="12" type="sub_section">
      <SectionTitle>
3.2 Multilevel Segmentation
</SectionTitle>
      <Paragraph position="0"> Chinese MWUs can consist of nested components that can be segmented in different ways for different levels to satisfy the requirements of different segmentation standards. The example below shows how Pey Jing Ri Ben Ren Xue Xiao Beijing Ribenren Xuexiao 'Beijing School for Japanese (nationals)' can be segmented on five different levels.</Paragraph>
      <Paragraph position="1">  1. Bei Jing Ri Ben Ren Xue Xiao multiword lexemic 2. Bei Jing +Ri Ben Ren +Xue Xiao lexemic 3. Bei Jing +Ri Ben + Ren +Xue Xiao sublexemic 4. Bei Jing + [Ri Ben + Ren ] [Xue +Xiao ] morphemic 5. [Bei +Jing ] [Ri +Ben +Ren ] [Xue +Xiao ] submorphemic  For some applications, such as MT and NER, the multiword lexemic level is most appropriate (the level most commonly used in CJKI's dictionaries). For others, such as embedded speech technology where dictionary size matters, the lexemic level is best. A more advanced and expensive solution is to store presegmented MWUs in the lexicon, or even to store nesting delimiters as shown above, making it possible to select the desired segmentation level.</Paragraph>
      <Paragraph position="2"> The problem of incorrect segmentation is especially obvious in the case of neologisms. Of course no lexical database can expect to keep up with the latest neologisms, and even the first edition of Chinese Gigaword does not yet have Bo Ke boke 'blog'. Here are some examples of MWU neologisms, some of which are not (at least bilingually), compositional but fully qualify as lexemes.</Paragraph>
      <Paragraph position="3">  simplifications in the postwar period. Chinese written in these simplified forms is called Simplified Chinese (SC). Taiwan, Hong Kong, and most overseas Chinese continue to use the old, complex forms, referred to as Traditional Chinese (TC). Contrary to popular perception, the  process of accurately converting SC to/from TC is full of complexities and pitfalls. The linguistic issues are discussed in Halpern and Kerman (1999), while technical issues are described in Lunde (1999). The conversion can be implemented on three levels in increasing order of sophistication:  1. Code Conversion. The easiest, but most un- null reliable, way to perform C2C is to transcode by using a one-to-one mapping table. Because of the numerous one-to-many ambiguities, as shown below, the rate of conversion failure is unacceptably high.</Paragraph>
      <Paragraph position="4">  2. Orthographic Conversion. The next level of  sophistication is to convert orthographic units, rather than codepoints. That is, meaningful linguistic units, equivalent to lexemes, with the important difference that the TC is the traditional version of the SC on a character form level. While code conversion is ambiguous, orthographic conversion gives much better results because the orthographic mapping tables enable conversion on the lexeme level, as shown below.  As can be seen, the ambiguities inherent in code conversion are resolved by using orthographic mapping tables, which avoids false conversions such as shown in the Incorrect column. Because of segmentation ambiguities, such conversion must be done with a segmentor that can break the text stream into meaningful units (Emerson 2000).</Paragraph>
      <Paragraph position="5"> An extra complication, among various others, is that some lexemes have one-to-many ortho- null graphic mappings, all of which are correct. For example, SC Yin Gan correctly maps to both TC Yin Gan 'dry in the shade' and TC Yin Gan 'the five even numbers'. Well designed orthographic mapping tables must take such anomalies into account. 3. Lexemic Conversion. The most sophisti null cated form of C2C conversion is called lexemic conversion, which maps SC and TC lexemes that are semantically, not orthographically, equivalent. For example, SC Xin Xi xinxi 'information' is converted into the semantically equivalent TC Zi Xun zi xun. This is similar to the difference between British pavement and American sidewalk. Tsou (2000) has demonstrated that there are numerous lexemic differences between SC and TC, especially in technical terms and proper nouns, e.g. there are more than 10 variants for Osama bin Laden.</Paragraph>
      <Paragraph position="6">  acter forms, leading to much confusion. Disambiguating these variants can be done by using mapping tables such as the one shown below. If such a table is carefully constructed by limiting it to cases of 100% semantic interchangeability for polysemes, it is easy to normalize a TC text by trivially replacing variants by their standardized forms. For this to work, all relevant components, such as MT dictionaries, search engine indexes and the related documents should be normalized. An extra complication is that Tai-</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="12" type="metho">
    <SectionTitle>
4 Orthographic Variation in Japanese
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
4.1 Highly Irregular Orthography
</SectionTitle>
      <Paragraph position="0"> The Japanese orthography is highly irregular, significantly more so than any other major language, including Chinese. A major factor is the complex interaction of the four scripts used to write Japanese, e.g. kanji, hiragana, katakana, and the Latin alphabet, resulting in countless words that can be written in a variety of often unpredictable ways, and the lack of a standardized orthography. For example, toriatsukai 'handling' can be written in six ways: Qu riXi i , Qu</Paragraph>
      <Paragraph position="2"> An example of how difficult Japanese IR can be is the proverbial 'A hen that lays golden eggs.' The &amp;quot;standard&amp;quot; orthography would be Jin noLuan wo Chan muJi Kin no tamago o umu niwatori. In reality, tamago 'egg' has four variants (Luan , Yu Zi , ta mago , tamago), niwatori 'chicken' three (Ji , ni watori, niwatori ) and umu 'to lay' two (Chan mu , Sheng mu), which expands to 24 permutations like Jin noLuan woSheng muniwatori , Jin noYu Zi woChan muJi etc.</Paragraph>
      <Paragraph position="3"> As can be easily verified by searching the web, these variants occur frequently.</Paragraph>
      <Paragraph position="4"> Linguistic tools that perform segmentation, MT, entity extraction and the like must identify and/or normalize such variants to perform dictionary lookup. Below is a brief discussion of what kind of variation occurs and how such normalization can be achieved.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
4.2 Okurigana Variants
</SectionTitle>
      <Paragraph position="0"> One of the most common types of orthographic variation in Japanese occurs in kana endings, called okurigana, that are attached to a kanji stem. For example, okonau 'perform' can be written Xing u or Xing nau , whereas toriatsukai can be written in the six ways shown above. Okurigana variants are numerous and unpredictable.</Paragraph>
      <Paragraph position="1"> Identifying them must play a major role in Japanese orthographic normalization. Although it is possible to create a dictionary of okurigana variants algorithmically, the resulting lexicon would be huge and may create numerous false positives not semantically interchangeable. The most effective solution is to use a lexicon of okurigana variants, such as the one shown below:</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="12" end_page="123" type="metho">
    <SectionTitle>
HEADWORD READING NORMALIZED
</SectionTitle>
    <Paragraph position="0"> Shu kiZhu su kakiarawasu Shu kiZhu su Shu kiZhu wasu kakiarawasu Shu kiZhu su Shu Zhu su kakiarawasu Shu kiZhu su Shu Zhu wasu kakiarawasu Shu kiZhu su Since Japanese is highly agglutinative and verbs can have numerous inflected forms, a lexicon such as the above must be used in conjunction with a morphological analyzer that can do accurate stemming, i.e. be capable of recognizing that Shu kiZhu simasendesita is the polite form of the canonical form Shu kiZhu su .</Paragraph>
    <Section position="1" start_page="12" end_page="123" type="sub_section">
      <SectionTitle>
4.3 Cross-Script Orthographic Variation
</SectionTitle>
      <Paragraph position="0"> Variation across the four scripts in Japanese is common and unpredictable, so that the same word can be written in any of several scripts, or even as a hybrid of multiple scripts, as shown below:  = 191,700. a is a coincidental occurrence factor, such as in '100Ren Can Jia , in which ' Ren Can ' is unrelated to the 'carrot' sense. The formulae for calculating the above are as follows.</Paragraph>
      <Paragraph position="1">  sure that all variants are indexed on a standardized form like Ren Can , recall is only 30%; if it is, there is a dramatic improvement and recall goes up to nearly 100%, without any loss in precision, which hovers at 100%.</Paragraph>
    </Section>
    <Section position="2" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
4.4 Kana Variants
</SectionTitle>
      <Paragraph position="0"> A sharp increase in the use of katakana in recent years is a major annoyance to NLP applications because katakana orthography is often irregular; it is quite common for the same word to be written in multiple, unpredictable ways. Although hiragana orthography is generally regular, a small number of irregularities persist. Some of the major types of kana variation are shown in the table below.</Paragraph>
      <Paragraph position="1">  The above is only a brief introduction to the most important types of kana variation. Though attempts at algorithmic solutions have been made by some NLP research laboratories (Brill 2001), the most practical solution is to use a katakana normalization table, such as the one shown below, as is being done by Yahoo! Japan and other major portals.</Paragraph>
    </Section>
    <Section position="3" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
4.5 Miscellaneous Variants
</SectionTitle>
      <Paragraph position="0"> There are various other types of orthographic variants in Japanese, described in Halpern (2000a). To mention some, kanji even in contemporary Japanese sometimes have variants, such as Cai for Sui and Jin for Fu , and traditional forms such as Fa for Fa . In addition, many kun homophones and their variable orthography are often close or even identical in meaning, i.e., noboru means 'go up' when written Shang ru but 'climb' when written Deng ru , so that great care must be taken in the normalization process so as to assure semantic interchangeability for all senses of polysemes; that is, to ensure that such forms are excluded from the normalization table.</Paragraph>
    </Section>
    <Section position="4" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
4.6 Lexicon-driven Normalization
</SectionTitle>
      <Paragraph position="0"> Leaving statistical methods aside, lexicon-driven normalization of Japanese orthographic variants can be achieved by using an orthographic mapping table such as the one shown below, using various techniques such as:  1. Convert variants to a standardized form for indexing.</Paragraph>
      <Paragraph position="1"> 2. Normalize queries for dictionary lookup. 3. Normalize all source documents.</Paragraph>
      <Paragraph position="2"> 4. Identify forms as members of a variant group.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="123" end_page="123" type="metho">
    <SectionTitle>
HEADWORD READING NORMALIZED
</SectionTitle>
    <Paragraph position="0"> Kong kiFou akikan Kong kiFou Kong Fou akikan Kong kiFou Ming kiGuan akikan Kong kiFou akiFou akikan Kong kiFou akiGuan akikan Kong kiFou Kong kikan akikan Kong kiFou Kong kikan akikan Kong kiFou Kong kiGuan akikan Kong kiFou Kong Guan akikan Kong kiFou Kong kiGuan akikan Kong kiFou Kong Guan akikan Kong kiFou  Other possibilities for normalization include advanced applications such as domain-specific synonym expansion, requiring Japanese thesauri based on domain ontologies, as is done by a select number of companies like Wand and Convera who build sophisticated Japanese IR systems. null</Paragraph>
  </Section>
  <Section position="9" start_page="123" end_page="123" type="metho">
    <SectionTitle>
5 Orthographic Variation in Korean
</SectionTitle>
    <Paragraph position="0"> Modern Korean has is a significant amount of orthographic variation, though far less than in Japanese. Combined with the morphological complexity of the language, this poses various challenges to developers of NLP tools. The issues are similar to Japanese in principle but differ in detail.</Paragraph>
    <Paragraph position="1"> Briefly, Korean has variant hangul spellings in the writing of loanwords, such as keikeu keikeu and keik keik for 'cake', and in the writing of non-Korean personal names, such as keulrinteon keulrinteon and keulrintonkeulrinton for 'Clinton'. In addition, similar to Japanese but on a smaller scale, Korean is written in a mixture of hangul, Chinese characters and the Latin alphabet. For example, 'shirt' can be written waisyeoceu wai-syeacheu or Ysyeoceu wai-syeacheu, whereas 'one o'clock' hanzi can written as hansi, 1si or [?] Shi . Another issue is the differences between South and North Korea spellings, such as N.K.</Paragraph>
    <Paragraph position="2"> osagga osakka vs. S.K. osaka osaka for 'Osaka', and the old (pre-1988) orthography versus the new, i.e. modern ilgun 'worker' ( ilgun) used to be written ilggun ( ilkkun).</Paragraph>
    <Paragraph position="3"> Lexical databases, such as normalization tables similar to the ones shown above for Japanese, are the only practical solution to identifying such variants, as they are in principle unpredictable. null</Paragraph>
  </Section>
  <Section position="10" start_page="123" end_page="123" type="metho">
    <SectionTitle>
6 The Role of Lexical Databases
</SectionTitle>
    <Paragraph position="0"> Because of the irregular orthography of CJK languages, procedures such as orthographic normalization cannot be based on statistical and probabilistic methods (e.g. bigramming) alone, not to speak of pure algorithmic methods. Many attempts have been made along these lines, as for example Brill (2001) and Goto et al. (2001), with some claiming performance equivalent to lexicon-driven methods, while Kwok (1997) reports good results with only a small lexicon and simple segmentor.</Paragraph>
    <Paragraph position="1"> Emerson (2000) and others have reported that a robust morphological analyzer capable of processing lexemes, rather than bigrams or ngrams, must be supported by a large-scale computational lexicon. This experience is shared by many of the world's major portals and MT developers, who make extensive use of lexical databases. null Unlike in the past, disk storage is no longer a major issue. Many researchers and developers, such as Prof. Franz Guenthner of the University of Munich, have come to realize that &amp;quot;language is in the data,&amp;quot; and &amp;quot;the data is in the dictionary,&amp;quot; even to the point of compiling full-form dictionaries with millions of entries rather than rely on statistical methods, such as Meaningful Machines who use a full form dictionary containing millions of entries in developing a human quality Spanish-to-English MT system.</Paragraph>
    <Paragraph position="2"> CJKI, which specializes in CJK and Arabic computational lexicography, is engaged in an ongoing research and development effort to compile CJK and Arabic lexical databases (currently about seven million entries), with special emphasis on proper nouns, orthographic normalization, and C2C. These resources are being subjected to heavy industrial use under real-world conditions, and the feedback thereof is being used to further expand these databases and to enhance the effectiveness of the NLP tools based on them.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML