File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1026_metho.xml
Size: 21,550 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1026"> <Title>Automatic Senmntic Classification for Chinese Unknown Compound Nouns</Title> <Section position="3" start_page="173" end_page="175" type="metho"> <SectionTitle> 2. Representation Models </SectionTitle> <Paragraph position="0"> Compounds are very productive types of unknown words. Nominal and verbal compounds are easily coined by combiniug two/many words/characters.</Paragraph> <Paragraph position="1"> Since there are more than 5000 commonly used Chinese characters and each with idiosyncratic syntactic behaviors, it is hm,t to derive a set of morphological rules to generate the set of Chinese colnpounds without over-generation or under-generation. The set of general compounds is an open-class. The strategy for automatic identification will be relied not only on the morpho-syntactic structures but also morpho-semantic relations. In general, certain interpretable semantic relationships between morphemic must be held. However there is no absolute lneans to judge whether the semantic relations between morphemic components are acceptable, i,e. the acceptability of such type of compounds is not simply 'yes' or 'no'. The degree of properness of a compound should depend on the logical relation between morphemic components and their logicalness should be judged by cominou sense knowledge.</Paragraph> <Paragraph position="2"> It is ahnost ilnpossible to in\]plement a system with common sense knowledge. Chen & Chen (1998) proposed an example-based measurement to evaluate the properness of a newly coined compound instead. They postulate that for a newly coined compound, if the semantic relation of its morphemic components is similar to the existing compounds, then it is inore likely that this newly coined compound is proper.</Paragraph> <Section position="1" start_page="173" end_page="175" type="sub_section"> <SectionTitle> 2.1 Example-based similarity nleasure </SectionTitle> <Paragraph position="0"> Supposed that a compound has the structure of XY where X and Y are morphemes and supposed without loss of generality Y is the head. For instance, ~i<~- 'learn-word-machine' is a noun compound and the head morphemeY is ~'machine' and the modifier X is ~': 'learnword'. In fact the morpheme f~{~ has four different meanings. They are 'machine', 'airplane', 'secret' and 'opportunity'. How do computers judge which one is the right meaning and how is the compound construction well-formed or logically lneaningful? First of all, the exalnples with the head morpheme ~ are extracted from corpora and dictionaries. The examples are classified according to their meaning as shown in the Tile meaning of ~l-\]i:i':I~,~- is then determined by comparing the similarity between it and each class of exalllples. Tile nleauing of the input ul\]kuown word will be assigned with the moaning of tile class with the most simihu morpho-semantic structures with this unknown word. Tile similarity measure is based on tile following formula.</Paragraph> <Paragraph position="1"> Supl)osed that each class of examples forlllS lhe following SOlUalltic relatioll rules. The rules silow the possible semantic relations between prel'ix and suffix Y and their weight in term ol' ihe frequency distribution of each semantic category of the profixes in tile class.</Paragraph> <Paragraph position="2"> Take sulTix ~- with moaning of 'machine' as oxaulple. Igor tile nlorphonle I{} 'ulachine', tile extracted COlllpotlllds of the fornl X+~j~-'machine' and tile semantic categories of the n3odifiors al'e shown il1 Table 2 and the n3orpl~ological rule derived froill them ix in Table 3. The scnlai/tic types alld their hierarchical structure are adopted fro111 tile Chilin (Moi el al. 1984). The similarity is measured between the semantic class of tile prefix X o1' tile unknown conlpound and tile prefix semantic types shown in the rule. ()no ot' the measuroulonts proposed is:</Paragraph> <Paragraph position="4"> Where Sere is tile semantic class of X. Max-value is tile maxinlal vahle o1' {~\] \[nfornlation Load(S ~Selni) * Freqi } for all semantic classes S. The iriax-wllue normalizes tile SIMILAR value to 0-1. S(hSemi denotes the least common ancestor of S and Semi. For instance, (Hh03('lHb06) = H. Tile Information-Load(S) of a senmntic class S ix defined as Entropy(sonlantic system) - Entropy(S). Simply speaking it is the anlount of reduced entropy after S is seen. En-</Paragraph> <Paragraph position="6"> where {Semi, Sem2 ..... Semk} is lhe set of the bottoln level selnantic classes contained in S.</Paragraph> <Paragraph position="8"> tile COlllpouuds el' X-&quot;f~ ~ math inc&quot; Take lhe word l~'~ i :J': I~ 'learning-wordinachine' as example. In tile table 3, the results show tile estiinated similarity between tile</Paragraph> <Paragraph position="10"> compound ~-~ and the extracted examples. The similarity value is also considered as the logical properness value of this compound. In this case is 0.67, which can be interpreted as that we have 67% of confidence to say that ~z~ 'learning-wordmachine' is a well-formed compound.</Paragraph> <Paragraph position="11"> The above representation model serves many functions. First of all it serves as the morl3hological rules of the colnpounds. Second it serves as a mean to implement the evaluation function. Third it serves as a mean to disambiguate the semantic ambiguity of the morphological head of a compound noun. For instance, them are four different @.</Paragraph> <Paragraph position="12"> Each denotes 'machine', 'airplane', 'opportunity' and 'secret' and they are considered as four different morphemes. The example shows that '~ ~}~'denotes a machine not other senses, since the evaluation score for matching the rules of '~-'machine' has the highest evaluation score among theln.</Paragraph> <Paragraph position="13"> The above discussion shows the basic concept and the base-line model of the example-based model. The above similarity measure is called over-all-similarity measure, since it takes the equal weight on the similarity values of the input compound with every member in the class. Another similarity measum is called maximal-similarity, which is defined as follows. It takes the maximal value of the similarity between input compound and every member in the class as the output.</Paragraph> <Paragraph position="15"> Both similarity measures are reasonable and have their own advantages. The experiment results showed that the combination of these two measures achieved the best performance on the weights of w 1=0.3 and w2=0.7 (Chen & Chen 1998), i.e. SIM = SIMI * wl + SIM2 * w2, where wl+w2 = 1. We adopt this measure in our experiments.</Paragraph> <Paragraph position="16"> It also showed a strong co-relation between the similarity scores and the human evaluation scores on the properness of testing compounds. The human considemd bad compounds showed also low similarity scores by computers.</Paragraph> </Section> </Section> <Section position="4" start_page="175" end_page="177" type="metho"> <SectionTitle> 3. System Implementation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="175" end_page="175" type="sub_section"> <SectionTitle> 3.1 Knowledge sources </SectionTitle> <Paragraph position="0"> To categorize unknown words, the computer system has to equip with the linguistic and semantic knowledge about words, morphemes, and word formation rules. The knowledge is facilitated to identify words, to categorize their semantic and syntactic classes, and to evaluate the properness of word formation and the confidence level of categorization. In our experiments, the available knowledge sources include: 1) CKIP lexicon: an 80,000 entry Chinese lexicon with syntactic categories for each entry (CKIP 1993).</Paragraph> <Paragraph position="1"> 2) Chilin: a thesaurus of synonym classes, which contains about 70,000 words distributed under 1428 semantic classes (Mei 1984).</Paragraph> <Paragraph position="2"> 3) Sinica Corpus: a 5 million word balanced Chinese corpus with word segmented and part-of-speech tagged (Chen 1996).</Paragraph> <Paragraph position="3"> 4) the Collocation Dictionary of Noun and Measure Words (CDNM) : The CDNM lists collocating measure words for nouns. The nouns in this dictionary are arranged by their ending morpheme, i.e. head morpheme. There are 1910 noun ending morphemes and 12,352 example nouns grouped according to their different senses.</Paragraph> <Paragraph position="4"> Each knowledge source provides partial data for representing morphological rules, which inclndes lists of sample compounds, high frequency morphemes and their syntactic and semantic information. Unknown words and their frequencies can be extracted from the Sinica corpus. The extracted unknown words produce the testing data and the morpheme-category association-strength which are used in the algorithm for the syntactic category prediction for unknown words (Chen et al. 1997). The CKIP dictionary provides the syntactic categories for morphemes and words. The Chilin provides the semantic categories for morpheme and words.</Paragraph> <Paragraph position="5"> The CDNM provides the set of high frequency noun morphemes and the example compounds grouped according to each difference sense. The semantic categories for each sense is extracted from the Chilin and disambiguated manually.</Paragraph> <Paragraph position="6"> The sample compounds for each sensedifl'emntiated morpheme extracted from CDNM form the base samples for the morphological rules. Additional samples are supplemented from the Chilin.</Paragraph> </Section> <Section position="2" start_page="175" end_page="176" type="sub_section"> <SectionTitle> 3.2 Tile algorithm for morphological analysis </SectionTitle> <Paragraph position="0"> The process of morphological analysis for compound words is very similar to Chinese word segmentation process. It requires dictionary look-up for matching nlorphemes and resolution methods for the inherent ambiguous segmenta- null tions, such as the exalnples in 1). However conventional word segmentation algorithms cannot apply for the morphological analysis without modification, since the nlorpho-syntactic behavior is different froth syntactic behavior. Since ihc structure of the Chinese COlllpound nOtlllS is head final and the most productive morphemes arc monosyllabic, there is a simple and effective algorithm, which agrees with these facts. This algorithm segments input compounds flom left to right by the longest matching criterion (Chcn& Liu 1992). It is clear that the loft to right longest lllaiching algorithm prel'ers shorler head and longer modifier structtlres.</Paragraph> </Section> <Section position="3" start_page="176" end_page="176" type="sub_section"> <SectionTitle> 3.3 Senlantic categories of morphemes </SectionTitle> <Paragraph position="0"> The semantic categories of morphemes arc lotlowed from the thesaurus Chilin. This thesaurus is a lattice structure of concept taxonomy. Morphemes/words may have multiple classification due to either ambiguous classification or inherent soillantic mnbiguitios. For lhe ambiguous scn'lantic categories o1' a morl)hcmo, lhc lower ranking senmntic categories will be eliminated and leave the higher-ranking scnlantic categories to conlpotc during the identification process. For instances, in the table 2 only the re;tier categories of each example are shown. Since the majority of nlorphemcs are unanlbiguous, they will compensate the uncertainty caused by die semantically ambiguous roof phemes. The rank of a semantic category of a mot'phonic depends on the Occurrillg order o1: lhis lilO1plionlo in ils synonyln group, since lhc arrangcincnt of the Chilin cilirics is by this natural, hi addition, dtlo to limit coverage o1: Chilin, nlally of the ll\]Ol'phemes arc not listed. For the unlisted morphemes, we recursivcly apply the currellt algorithm to predict their semantic categories.</Paragraph> <Paragraph position="1"> 4. Semantic Chlssification and Ambiguity</Paragraph> </Section> <Section position="4" start_page="176" end_page="176" type="sub_section"> <SectionTitle> Resolution for Compound Nouns </SectionTitle> <Paragraph position="0"> The demand o1&quot; a semantic chlssification system for COlllpound nouns was first raised while the task of selnantic tagging for Chinese corpus was lriod. The Sin|ca corpus is a 5 in|Ilion-word Chinese corpus with part-of speech lagging, lit lhis corpus there are 47,777 word typos tagged with conllllOn nOl.lllS and Ollly 12,536 Of tholll are listed ill the Chilin. They count only 26.23%. In oilier words the scmandc categories for most of the common nouns arc tinknown. They will be the target for automatic semantic classification.</Paragraph> </Section> <Section position="5" start_page="176" end_page="176" type="sub_section"> <SectionTitle> 4.1 Derivation of morphological rules </SectionTitle> <Paragraph position="0"> A list of' most productive lriorphoinos arc first generated from the unknown words extracted fl'om the Sinica corpus. The morphological rules o1' the sot of the lllOSl productive head morphonies {llO derived flonl their examples. Both the CI)MN all(\] Chilin provide SOlilO oxanlplcs.</Paragraph> <Paragraph position="1"> So lhr there are 1910 head morphemes for compound nouns with examples in the system and increasing. They are all monosyllabic morphemes. For the top 200 most productive morphenlcs, among them 51.5% are polysemous and in average each has 3.5 different meanings. \]'tie coverage of ihe ctlrrollt 1910 illorphonlos is aboul 71% of ihc uilkiiown noun conlpounds of the iesling dala. The rosl 29% tincovorod noun nlorphonlos are cilher polysyllabic i-llorpholiies or/lie low frequency nlorl~hemes.</Paragraph> </Section> <Section position="6" start_page="176" end_page="177" type="sub_section"> <SectionTitle> 4.2 Semantic classification algorithm </SectionTitle> <Paragraph position="0"> The unknown compound nouns extracted from the Sinica corpus w'cre classified according to Ihc morphological representation by the simihtrity-bascd algoriltnn. The problcms of semantic ambiguitics and out-of-coverage morphcmcs were two major dilTicultics to be solved during the classification stage. The complete scmanlic classification algorilhm is as follows: I) For each inpu! noun compound, apply morphological analysis algorilhm lo derive die morphemic components of the input compound. null 2) I)clcrminc the head nlorphenlc and modifiers. 'flit: dcfaull head illorphclllo is lhc last liiorphonic of a conlpound.</Paragraph> <Paragraph position="1"> 3) Got die synlactic and semantic categories of the modifiers. If a modil\]or is also an tinknown word, lhen apply this algorilhm rocursively to idendfy its son-ialltic category. 4) For lhe head morpholne with the representational rules, apply siinilarity illeastlro for each possible sornantic chtss and outptlt the somanlic class with lhe highest siinilariiy wthic. 5) If the head illorphonlo is not covered by tile nlorphological rules, search its semantic class from the Chilin. If its semantic class is not list in the Chilin, then no ariswcr can be found, if it is polysemous, then the top ranked classes will be the output.</Paragraph> <Paragraph position="2"> In lhc step I, thc algorithm rcsolvcs the possible ambiguities o1' the morphological slrtlcttlrcs of the input COlllpound. In the step 3, the selllantic categories of the modil'ier arc determined. There arc some complications. The firsl complication is lhat lhe modifier has nmltiple semantic care- null gories. In our current process, tile categories of lower ranking order will be eliminated. The remaining categories will be processed independently. One of the semantic categories of the modifier pairing with one of the rule of the head morpheme with the category will achieve the maximal similarity value. The step 4 thus achieves the resolution of both semantic ambiguities of the head and tile modifier. However only the category of the head is our target of resolution. The second complication is that the modifier is also unknown. If it is a not listed in the Chilin, there is no way of knowing its semantic categories by tile era'rent available resources. At the step 4, the prediction of semantic category of the input compound will depend solely on the information about its head morpheme. If the head morpheme is unambiguous then output the category of the head morpheme as the prediction.</Paragraph> <Paragraph position="3"> Otherwise, output the semantic category of the top rank sense of the head morpheme. The step 5 handles the cases of exceptions, i.e. no representational rule for head morphemes.</Paragraph> </Section> <Section position="7" start_page="177" end_page="177" type="sub_section"> <SectionTitle> 4.3 Experimental results </SectionTitle> <Paragraph position="0"> The system classifes the set of unknown common nouns extracted from tile Sinica corpus. We randomly picked two hundred samples from tile output for the performance evaluation by examining the semantic classification manually. The correction rate for semantic classil'ication is 84% and 81% for tile frst hundred samples and the second hundred samples respectively. We further classify tim errors into different types. The first type is caused by the selection error while disambiguating the polysemous head lnorphemes. The second type is caused by the fact that the meanings of some compounds are not semantic composition of tile meanings of their morphological components. Tile third type errors are caused by the fact that a few compounds are conjunctive structures not assumed head-modifier structure by the system. Tile forth type errors are caused by the head-initial constructions. Other than tile classification errors, there exist 10 unidentifiable colnpounds, 4 and 6 in each set, for their head morphemes are not listed in tile system nor in the Chilin. Among tile 190 identifiable head morphemes, 142 of them are covered by the morphological rules encoded in the system and 80 of theln have multiple semantic categories.</Paragraph> <Paragraph position="1"> Tile semantic categories of remaining 48 head morphemes were found fiom the Chilin. If the type 1 selection errors are all caused by the 80 morphemes with multiple semantic categories, then the correction rate of semantic disambiguation by our similarity-based measure is (80-</Paragraph> </Section> </Section> <Section position="5" start_page="177" end_page="178" type="metho"> <SectionTitle> 5. Further Remarks and Conclusions </SectionTitle> <Paragraph position="0"> In general if an unknown word was extracted from corpora, both of its syntactic and semantic categories are not known. The syntactic categories will be predicted first according to its prefix-category and suffix-category associations as mentioned in (Chen et al. 1997). According to the top ranked syntactic predictions, each respective semantic representational rules or models will be applied to produce the morpho-semantic plausibility of the unknown word of its respective syntactic categorization. For instance if the predicted syntactic the semantic classification algorithm categories are either a common noun or a verb, the algorithm present in this paper will be carried out to classify its semantic category and produce its plausibility value for the noun category. Similar process should deal with tile case of verbs and produce tile plausibility of being a verb. The final syntactic and semantic prediction will be based oil their plausibility values and its contextual environments (Bai et al. 1998, Ide 1998).</Paragraph> <Paragraph position="1"> The advantages of tile current representational model are: 1) it is declarative. New examples and new mor- null phemes can be added into the system withoul changing the processing algorilhm, but 111e performance o1' the system might be increased due to the increlnent of the knowledge.</Paragraph> <Paragraph position="2"> 2) The representational model not only provides the semantic classification of the unknown words but also gives the wdue of the phmsibility o1' a compound construction. This value could be utilized to resolve the alnbiguous matching between compeling compound rules.</Paragraph> <Paragraph position="3"> 3) The representational model can be extended for presenting compound verbs.</Paragraph> <Paragraph position="4"> 4) It acts as one of the major building block of a self-learning systeln for linguistic and world knowledge acquisition on the lnternel environlllellt. null Tile classification errors are caused by a) some of the testing examples have no semantic composition property, b) some semantic classifications are too much fine-grained. There is no clear cut difference between some classes, even Imman judge cannot lnake a right classification, c) there are not enough samples that causes the simihuity-based model does not work on the suffixes with few or no sample data. The above classification errors can be resolved by collecting the new words, which are Selnantically nol>compositional, into tile lexicon and by adding new examples for each naorphenle.</Paragraph> <Paragraph position="5"> Current Selnantic categorization system only roughly classifies the unknown compound nouns according to their semantic heads. In the future deeper analysis on the semantic relations between modifier and head should also be carried otll.</Paragraph> </Section> class="xml-element"></Paper>