File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0102_metho.xml
Size: 20,650 bytes
Last Modified: 2025-10-06 14:10:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0102"> <Title>Chinese Lexical Resource</Title> <Section position="4" start_page="9" end_page="10" type="metho"> <SectionTitle> 2 Existing Resources and Related Work </SectionTitle> <Paragraph position="0"> The construction and development of large lexical resources is relying more and more on corpus-based approaches, not only as a result of the increased availability of large corpora, but also for the authoritativeness and authenticity allowed by the approach. The Collins COBUILD English Dictionary (Sinclair, 1987) is amongst the most well-known lexicographic fruit based on large corpora.</Paragraph> <Paragraph position="1"> For natural language applications, much of the information in conventional dictionaries targeted at human readers must be made explicit. Lexical resources for computer use thus need considerable manipulation, customisation, and supplementation (e.g. Calzolari, 1982). WordNet (Miller et al., 1990), grouping words into synsets and linking them up with relational pointers, is probably the first broad coverage general computational lexical database. In view of the intensive time and effort required in resource building, some researchers have taken an alternative route by extracting information from existing machine-readable dictionaries and corpora semi-automatically (e.g. Vossen et al., 1989; Riloff and Shepherd, 1999; Lin et al, 2003).</Paragraph> <Paragraph position="2"> Compared to the development of thesauri and lexical databases, and research into semantic networks for major languages such as English, similar work for the Chinese language is less mature. This gap was partly due to the lack of authoritative Chinese corpora as a basis for analysis, but has been gradually reduced with the recent availability of large Chinese corpora including the LIVAC synchronous corpus (Tsou and Lai, 2003) used in this work and further described below, the Sinica Corpus (Chen et al., 1996), the Chinese Penn Treebank (Xia et al., 2000), and the like.</Paragraph> <Paragraph position="3"> An important issue which is seldom addressed in the construction of Chinese lexical databases is the problem of versatility and portability. For a language such as Chinese which is spoken in many different communities, different linguistic norms have emerged as a result of the individualistic evolution and development of the language within a particular community and culture. Such variations are seldom adequately reflected in existing lexical resources, which often only draw reference from one particular source. For instance, Tongyici Cilin (Tong Yi Ci Ci Lin ) (Mei et al., 1984) is a thesaurus containing some 70,000 Chinese lexical items in the tradition of the Roget's Thesaurus for English, that is, in a hierarchy of broad conceptual categories. First published in the 1980s, it was based exclusively on Chinese as used in post-1949 Mainland China.</Paragraph> <Paragraph position="4"> Thus for the subway example above, the closest word group found is Huo Che , Lie Che (train) only, let alone the subway itself and its regional variations. With the recent availability of large corpora, especially synchronous ones, to construct an authoritative and timely lexical resource for Chinese is less distant than it was in the past. A large synchronous corpus provides authentic examples of the language as used in a variety of locations. It thus enables us to attempt a comprehensive and in-depth analysis of the core common language in constructing a lexical resource; and to incorporate useful information relating to location-sensitive linguistic variations.</Paragraph> </Section> <Section position="5" start_page="10" end_page="10" type="metho"> <SectionTitle> 3 Proposal of a Pan-Chinese Thesaurus </SectionTitle> <Paragraph position="0"> The Pan-Chinese lexicon proposed by Tsou and Kwong (2006) is expected to capture not only the core senses of lexical items but also senses and uses specific to individual Chinese speech communities.</Paragraph> <Paragraph position="1"> The lexical database will be organised into a core database and a supplementary one. The core database will contain the core lexical information for word senses and usages which are common to most Chinese speech communities, whereas the supplementary database will contain the language uses specific to individual communities, including &quot;marginal&quot; and &quot;sublanguage&quot; uses.</Paragraph> <Paragraph position="2"> A network structure will be adopted for the lexical items. The nodes could be sets of near-synonyms or single lexical items (in which case synonymy will be one type of links). The links will not only represent the paradigmatic semantic relations but also syntagmatic ones (such as selectional restrictions).</Paragraph> <Paragraph position="3"> We thus begin by investigating in depth the regional variation of lexical items, especially domain-specific words, among several Chinese speech communities. In addition, we explore the potential of enriching existing resources as a start. In the following section, we will discuss the Tongyici Cilin and the synchronous Chinese corpus used in this study in greater details.</Paragraph> </Section> <Section position="6" start_page="10" end_page="11" type="metho"> <SectionTitle> 4 Materials and Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 4.1 The Tongyici Cilin </SectionTitle> <Paragraph position="0"> The Tongyici Cilin (Tong Yi Ci Ci Lin ) (Mei et al., 1984) is a Chinese synonym dictionary, or more often known as a Chinese thesaurus in the tradition of the Roget's Thesaurus for English. The Roget's Thesaurus has about 1,000 numbered semantic heads, more generally grouped under higher level semantic classes and subclasses, and more specifically differentiated into paragraphs and semicolon-separated word groups. Similarly, some 70,000 Chinese lexical items are organized into a hierarchy of broad conceptual categories in the Tongyici Cilin. Its classification consists of</Paragraph> </Section> <Section position="2" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 4.2 The LIVAC Synchronous Corpus </SectionTitle> <Paragraph position="0"> LIVAC (http://www.livac.org) stands for Linguistic Variation in Chinese Speech Communities. It is a synchronous corpus developed by the Language Information Sciences Research Centre of the City University of Hong Kong since 1995 (Tsou and Lai, 2003). The corpus consists of newspaper articles collected regularly and synchronously from six Chinese speech communities, namely Hong Kong, Beijing, Taipei, Singapore, Shanghai, and Macau. Texts collected cover a variety of domains, including front page news stories, local news, international news, editorials, sports news, entertainment news, and financial news. Up to December 2005, the corpus has already accumulated about 180 million character tokens which, upon automatic word segmentation and manual verification, amount to over 900K word types.</Paragraph> <Paragraph position="1"> For the present study, we make use of the sub-corpora collected over the 9-year period 19952004 from Hong Kong (HK), Beijing (BJ), Taipei (TW), and Singapore (SG). In particular, we focus on the financial news and sports news to investigate the commonality and uniqueness of the lexical items used in these specific domains in the various communities. We also evaluate the adequacy of the Tongyici Cilin in terms of its coverage of such domain-specific terms especially from the Pan-Chinese perspective, and thus assess the room for its enrichment with the synchronous corpus. Table 1 shows the sizes of the subcorpora used for this study.</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.3 Procedures </SectionTitle> <Paragraph position="0"> Word-frequency lists were generated from the financial and sports subcorpora from each individual community. For each resulting list, the steps below were followed to remove irrelevant items and retain only the potentially useful content words: (a) Remove all numbers and non-Chinese words. (b) Remove all proper names, including those annotated as personal names, geographical names, and organisation names. Proper names have been annotated in the corpora during the process of word segmentation.</Paragraph> <Paragraph position="1"> (c) Remove function words.</Paragraph> <Paragraph position="2"> (d) Remove lexical items with frequency 5 or below.</Paragraph> <Paragraph position="3"> The numbers of remaining items in each sub-corpus after the above steps are listed in Tables 2 and 3 for the two domains respectively. The lexical items retained, which are expected to contain a substantial amount of content words, are potentially useful for the current study. The lists in each domain (from the various subcorpora) were compared in terms of the items they share and those unique to individual communities. Their unique items were also compared against the Tongyici Cilin to investigate its adequacy and explore how it might be enriched with the synchronous corpus.</Paragraph> </Section> </Section> <Section position="7" start_page="11" end_page="14" type="metho"> <SectionTitle> 5 Results and Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 5.1 Lexical Items from LIVAC </SectionTitle> <Paragraph position="0"> The four subcorpora of the financial domain differ considerably in their sizes, and slightly less so for the sports domain. Despite this, we observed for both domains from Tables 2 and 3 that in general about 40-50% of all word types are numbers, non-Chinese words, proper names, and function words. Of the remaining items, about 20-30% have frequency greater than 5. These several thousand word types from each subcorpus are expected to be amongst the more interesting items and form the &quot;candidate sets&quot; for further investigation.</Paragraph> </Section> <Section position="2" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 5.2 Commonality among Various Regions </SectionTitle> <Paragraph position="0"> Comparing the candidate sets from various subcorpora, which reflect the use of Chinese in various Chinese speech communities, Tables 4 and 5 show the sizes of the intersection sets among different places for the two domains respectively.</Paragraph> <Paragraph position="1"> The intersection set for all four places contains slightly more than 1,000 lexical items in the financial domain. A quick skim through these common lexical items suggests that they contain, on the one hand, the many general concepts in the financial domain (e.g. Gong Si company, Shi Chang market, Yin Xing bank, Tou Zi invest / investment, Ye Wu business, Fa Zhan develop / development, Ji Tuan corporation, Gu Fen stock shares, Gu Dong shareholder, Zi Jin capital, etc.); and on the other hand, many reportage and cognitive verbs often used in news articles (e.g. Biao Shi express, Ren Wei reckon, Chu Xian appear, Fan Ying reflect, etc.).</Paragraph> <Paragraph position="2"> In the sports domain, more than 1,700 lexical items were found in all of the four subcorpora.</Paragraph> <Paragraph position="3"> Like its financial counterpart, we found many general concepts at the top of the list (e.g. Qiu Yuan player, Qiu Dui team, Sai Shi match, Bi Sai competi- null tion, Lian Sai league, Jiao Lian coach, Dui Shou opponent, Guan Jun champion, etc.).</Paragraph> <Paragraph position="4"> The numbers of overlaps in Tables 4 suggest that lexical items used in Mainland China (as evident from BJ data) seem to have the least in common with the rest. For instance, compared to the overlap amongst all four regions (i.e. 1,039), the overlap has increased most when BJ was not included in the comparison; and when we compare any two regions, the overlap between BJ and TW is smallest. Nevertheless, such uniqueness of BJ data is less apparent in the sports domain. In particular, the difference between HK/BJ and BJ/TW is even slightly less than that in the financial domain.</Paragraph> <Paragraph position="5"> If we look at the individual regions, HK apparently shares most (about 50%) with SG, and vice versa (about 68%), in the financial domain.</Paragraph> <Paragraph position="6"> At the same time, BJ also shares more with HK than with the other two regions, and so does TW.</Paragraph> <Paragraph position="7"> But surprisingly, BJ has over 60% overlap with SG and about 55% with TW in the sports domain.</Paragraph> <Paragraph position="8"> The overlaps of TW with HK and with BJ differ by more than 20% in the finance domain, but only by about 10% in the sports domain. All these patterns might suggest lexical items in the financial domain are more versatile and have more varied focus in different communities, whereas those in the sports domain reflect the more common interests of different places.</Paragraph> <Paragraph position="9"> Regions Overlap Proportion to individual lists (%)</Paragraph> <Paragraph position="11"/> <Paragraph position="13"/> </Section> <Section position="3" start_page="12" end_page="13" type="sub_section"> <SectionTitle> 5.3 Uniqueness of Various Regions </SectionTitle> <Paragraph position="0"> Next we compared the lists with respect to what they have unique to themselves. Table 6 shows the numbers of unique items found in each list, together with examples from the most frequent 20 unique items in each case.</Paragraph> <Paragraph position="1"> Again, taking the size difference among the candidate sets into account, about 40% of the lexical items found in HK data are unique to the region, which re-echoes the versatility and wide coverage of interests of HK data. This is especially evident when compared to only about 20% of the candidate sets for SG are unique to Singapore.1 null Looking at the unique lexical items found in individual regions, it is not difficult to see the region-specific lexicalisation of certain concepts. For instance, in terms of housing, Ju Wu (housing under the Home Ownership Scheme) is a specific kind of housing in Hong Kong, Zu Wu is a specific term in Singapore (as seen in SG data), whereas housing is generally expressed as Zhu Fang in Mainland China (as seen in BJ data). Similarly, Cao Lian (HK) and Dong Xun (BJ) both refer to training, but may relate to different practice in the two communities. Such regional variation lends strong support to the importance of a Pan-Chinese lexical resource.</Paragraph> <Paragraph position="2"> The lists of unique items also suggest the various focus and orientation in different Chinese speech communities. For example, while Hong Kong pays much attention to the real estate market and stock market, Mainland China may be focusing more on the basic needs like water, farming, poverty alleviation, etc., and Singapore is relatively more concerned with local affairs like port management. The passion for baseball, among other more popular sports like soccer, is most obvious from the unique lexical items found in TW data.</Paragraph> </Section> <Section position="4" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 5.4 Comparison with Tongyici Cilin </SectionTitle> <Paragraph position="0"> As mentioned earlier, the Tongyici Cilin contains some 70,000 lexical items under 12 broad semantic classes, 94 subclasses, and 1,428 heads.</Paragraph> <Paragraph position="1"> It was first published in the 1980s and was based on lexical usages mostly of post-1949 Mainland China. In this section, we discuss the results obtained from comparing the unique lexical items found from individual subcorpora with Cilin, which are shown in Table 7.</Paragraph> <Paragraph position="2"> On the one hand, Cilin's collection of words may be considerably dated and obviously will not include new concepts and neologisms arising in the last two decades. On the other hand, the data in LIVAC come from newspaper materials in the 1990s. So overall speaking, for each of the unique word lists, much less than 50% are covered in Cilin.</Paragraph> <Paragraph position="3"> 1 Upon further analysis, on average about 60% of these &quot;unique&quot; items were actually found in one or more of the other regions, but with frequency 5 or below. Since the difference in frequency is quite large for most items, we can reasonably treat them as unique to a particular community. Nevertheless, there is still an apparent gap between Cilin's coverage of the unique items from various places. About 40% of the unique items found in BJ for both domains are covered; but for other places, the coverage is more often less than 30% in either or both domains. Again, this could be considered a result of Cilin's bias toward lexical usages in Mainland China.</Paragraph> <Paragraph position="4"> In addition, while almost 40% of the unique items in BJ data are found in Cilin, many of these unique items covered are amongst the most frequent items. On the contrary, even though about 560 unique items in HK data are also found in Cilin, only 3 out of the 20 most frequent items are amongst them. In addition, the apparent coverage does not necessarily suggest the correct match of word senses. For instance, Ju Wu is found under head Bn1 together with other items like Zhu Fang , Zhu Zhai , etc., all of which only refer to the general concept of housing, instead of the housing specifically under the Home Ownership Scheme as known in Hong Kong. Also, coverage of words like Chen Xi , Di Wang and Shui Shou in the sports domain does not match their actual usages which refer to team names. A more interesting example might be Huo Guo , which is used in the basketball context in TW data, and in no way refers to the literal &quot;hot pot&quot; sense.</Paragraph> <Paragraph position="5"> Results from the above comparisons thus support that (1) different Chinese speech communities have their distinct usage of Chinese lexical items, in terms of both form and sense; (2) such variation is found in different domains, such as the financial and sports domain; (3) existing lexical resources, the Tongyici Cilin in particular as in our current study, should be enriched and enhanced by capturing lexical usages from a variety of Chinese speech communities, to represent the lexical items from a Pan-Chinese perspective; and (4) lexical items obtained from the synchronous Chinese corpus can supplement the existing content of the Tongyici Cilin, with more contemporarily lexicalised concepts, as well as variant expressions of similar and related concepts from various Chinese speech communities.</Paragraph> <Paragraph position="6"> Hence it remains for us to further investigate how the related lexical items obtained from the synchronous corpus should be grouped and incorporated into the semantic classification of existing lexical resources; and to further explore how they might be extracted in a large scale by automatic means. These will definitely be amongst the most important future directions as discussed in the next section.</Paragraph> </Section> </Section> <Section position="8" start_page="14" end_page="14" type="metho"> <SectionTitle> 6 Future Work </SectionTitle> <Paragraph position="0"> In the current study, we have investigated the regional variation of lexical items from the financial and sports domain, and the coverage of the Tongyici Cilin for such variation. The results suggested great potential for building a Pan-Chinese lexical resource for Chinese language processing. Our next step would thus be to further investigate more automatic means for extracting the near-synonymous or closely related items from the various subcorpora. To this end, we would explore algorithms like those used in Lin et al. (2003). Of similar importance is the mechanism for grouping the related lexical items and incorporating them into the semantic classifications of existing lexical resources. In this regard we will proceed with further in-depth analysis of the classificatory structures of individual resources and fit in our Pan-Chinese architecture. null Apart from the Tongyici Cilin, there are other existing Chinese lexical resources such as HowNet (Dong and Dong, 2000), SUMO and Chinese WordNet (Huang et al., 2004), as well as other synonym dictionaries from which we might draw reference to build up our Pan-Chinese lexical resource.</Paragraph> </Section> class="xml-element"></Paper>