File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0102_intro.xml
Size: 4,272 bytes
Last Modified: 2025-10-06 14:03:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0102"> <Title>Chinese Lexical Resource</Title> <Section position="3" start_page="0" end_page="9" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many cities have underground railway systems.</Paragraph> <Paragraph position="1"> Somehow one takes the tube in London but the subway in New York. In a more recent edition of the Roget's Thesaurus (Kirkpatrick, 1987), subway, tube, underground railway and metro are found in the same semicolon-separated group under head 624 Way. Similarly if one looks up WordNet (http://wordnet.princeton.edu; Miller et al., 1990), the synset to which subway belongs also contains the words metro, tube, underground, and subway system; and it is further indicated that &quot;in Paris the subway system is called the 'metro' and in London it is called the 'tube' or the 'underground'&quot;. Such regional lexical variation is also found in Chinese. For instance, the subway system in Hong Kong, known as the Mass Transit Railway or MTR, is called Di Tie in Chinese. The subway systems in Beijing and Shanghai, as well as the one in Singapore, are also known as Di Tie , but that in Taipei is known as Jie Yun . Their counterpart in Japan is written as Di Xia Tie in Kanji. Such regional variation, as part of lexical knowledge, is important and useful for many natural language applications, including natural language understanding, information retrieval, and machine translation. Unfortunately, existing Chinese lexical resources often lack such comprehensiveness.</Paragraph> <Paragraph position="2"> To fill this gap, Tsou and Kwong (2006) proposed a comprehensive Pan-Chinese lexical resource, based on a large and unique synchronous Chinese corpus as an authentic basis for lexical acquisition and analysis across various Chinese speech communities. For a significant world language like Chinese, a useful lexical resource should have maximum versatility and portability.</Paragraph> <Paragraph position="3"> It is not sufficient to target at one particular community speaking the language and thus cover only language usage observed from that particular community. Instead, such a lexical resource should document the core and universal substances of the language on the one hand, and also the more subtle variations found in different communities on the other. As is evident from the above example on the variation of subway, regional variation should be captured for the lexical resource to be useful in a wide range of applications. null In this study, we investigate and compare the regional variation of lexical items from two spe- null cific domains, finance and sports, as an initial and necessary step toward the more important undertaking of building a Pan-Chinese lexical resource. In addition, we make use of an existing Chinese synonym dictionary, the Tongyici Cilin (Mei et al., 1984) as leverage, and explore its coverage of such variation and thus the potential for enriching it. The lexical items under study were obtained from a synchronous Chinese corpus, LIVAC, which will be further introduced in Section 4. Corpus data from four Chinese speech communities were compared with respect to their commonality and uniqueness, and also against Cilin for their coverage. Results showed that 20-40% of the words extracted from the corpus are unique to the individual communities, and as much as 70% of such unique items are not yet covered in Cilin. It therefore suggests that the synchronous corpus is a rich source for mining region-specific lexical items, and there is great potential for building a Pan-Chinese lexical resource for Chinese language processing.</Paragraph> <Paragraph position="4"> In Section 2, we will briefly review existing resources and related work. Then in Section 3, we will briefly outline the design and architecture of the Pan-Chinese lexical resource proposed by Tsou and Kwong (2006). In Section 4, we will further describe the Chinese synonym dictionary and the synchronous Chinese corpus used in this study. The comparison of their lexical items will be discussed in Section 5. Future directions will be presented in Section 6, followed by a conclusion.</Paragraph> </Section> class="xml-element"></Paper>