File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3008_metho.xml

Size: 23,694 bytes

Last Modified: 2025-10-06 14:09:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3008">
  <Title>Word Meaning Inducing via Character Ontology: A Survey on the Semantic Prediction of Chinese Two-Character Words Shu-Kai Hsieh Seminar f&amp;quot;ur Sprachwissenschaft</Title>
  <Section position="3" start_page="0" end_page="56" type="metho">
    <SectionTitle>
2 Word Meaning Inducing via
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="56" type="sub_section">
      <SectionTitle>
Character Meaning
2.1 Morpho-Semantic Description
</SectionTitle>
      <Paragraph position="0"> As known, &amp;quot;bound roots&amp;quot; are the largest classes of morpheme types in Chinese morphology, and they are very productive and represent lexical rather than grammatical information (Packard 2000).</Paragraph>
      <Paragraph position="1"> This morphological phenomena leads many Chinese linguists to view the word components (i.e., characters) as building blocks in the semantic composition process of dis- or multisyllabic words. In many empirical studies (Tseng and Chen (2002); Tseng (2003); Lua (1993); Chen (2004)), this view has been confirmed repeatedly.</Paragraph>
      <Paragraph position="2"> In the semantic studies of Chinese word formation, many descriptive and cognitive semantic approaches have been proposed, such as argument structure analysis (Chang 1998) and the frame-based semantic analysis (Chu 2004). However, among these qualitative explanation theoretical models, problems often appear in the lack of predictability on the one end of spectrum, or over-generation on the other.1 Empirical data have  gument structure and theta-grid in Chinese V-V compounds, Chang (1998) found some examples which may satisfy the semantic and syntactic constraints, but they may not be ac- null also shown that in many cases, - e.g., the abundance of phrasal lexical units in any natural language, - the principle of compositionality in a strict sense, that is, &amp;quot;the meaning of a complex expression can be fully derivable from the meanings of its component parts, and from the schemas which sanction their combination&amp;quot;(Taylor 2002), which is taken to be a fundamental proposition in some of morpho-semantically motivated analysis, is highly questionable.</Paragraph>
      <Paragraph position="3"> This has given to the consideration of the embeddedness of linguistic meanings within broader conceptual structures. In what follows, we will argue that an ontology-based approach would provide an interesting and efficient prospective toward the character-triggered morpho-semantic analysis of Chinese words.</Paragraph>
    </Section>
    <Section position="2" start_page="56" end_page="56" type="sub_section">
      <SectionTitle>
2.2 Conceptual Aggregate in Compounding:
</SectionTitle>
      <Paragraph position="0"> A Shift Toward Character Ontology In prior studies, it is widely presumed that the category (be it syntactical or semantic) of a word, is somehow strongly associated with that of its composing characters. The semantic compositionality underlying two-character words appears in different terms in the literature.2 Word semantic similarity calculation techniques have been commonly used to retrieve the similar compositional patterns based on semantic taxonomic thesaurus. However, one weak point in these studies is that they are unable to separate conceptual and semantic levels. Problem raises when words in question are conceptually correlated are not necessarily semantically correlated, viz, they might or might not be physically close in the CILIN thesaurus (Mei et al 1998).</Paragraph>
      <Paragraph position="1"> On closer observations, we found that most synonymic words (i.e., with the same CILIN semantic class) have characters which carry similar conceptual information. This could be best illustrated by examples. Table 1 shows the conceptual distribution of the modifiers of an example of VV compound by presuming the second character  |as a ceptable to native speakers.</Paragraph>
      <Paragraph position="2"> 2Using statistical techniques, Lua (1993) found out that each Chinese two-character word is a result of 16 types of semantic transformation patterns, which are extracted from the meanings of its constituent characters. In Chen (2004), the combination pattern is referred to as compounding semantic template.</Paragraph>
      <Paragraph position="3"> head. The first column is the semantic class of CILIN (middle level), the second column lists the instances with lower level classification number, and the third column lists their conceptual types adopted from a character ontology we will discuss later. As we can see, though there are 12 resulting semantic classes for the *  |compounds, the modifier components of these compounds involve only 4 concept types as follows:</Paragraph>
      <Paragraph position="5"> We defined these patterns as conceptual aggregate pattern in compounding. Unlike statistical measure of the co-occurrence restrictions or association strength, a concept aggregate pattern provides a more knowledge-rich scenario to represent a specific manner in which concepts are aggregated in the ontological background, and how they affect the compounding words. We will propose that the semantic class prediction of Chinese two-character words could be improved by making use of their conceptual aggregate pattern of head/modifier component.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="56" end_page="57" type="metho">
    <SectionTitle>
3 Semantic Prediction of Unknown
Two-Character Words
</SectionTitle>
    <Paragraph position="0"> The practical task intended to be experimented here involves the automatic classification of Chinese two-character words into a predetermined number of semantic classes. Difficulties encountered in previous researches could be summarized as follows: First, many models (Chen and Chen 1998;2000) cannot deal with the issue of &amp;quot;incompleteness&amp;quot; of characters in the lexicon, for these models depend heavily on CILIN, a Chinese Thesaurus containing only about 4,133 monosyllabic morphemic components (characters). As a result, if unknown words contain characters that are not listed in CILIN, then the prediction task cannot be performed automatically. Second, the ambiguity of characters is often shunned by  manual pre-selection of character meaning in the training step, which causes great difficulty for an automatic work. Third, it has long been assumed (Lua 1997; Chen and Chen 2000) that the overwhelming majority of Chinese compounds are more or less endocentric, where the compounds denote a hyponym of the head component in the compound. E.g, Us (&amp;quot;electric-mail&amp;quot;; e-mail) is a kind of mail. So the process of identifying semantic class of a compound boils down to find and to determine the semantic class of its head morpheme. However, there is also an amount of exocentric and appositional compounds3 where no straightforward criteria can be made to determine the head component. For example, in a case of VV compound o1/2 (&amp;quot;denounce-scold&amp;quot;, drop-on), it is difficult (and subjective) to say which character is the head that can assign a semantic class to the compound.</Paragraph>
    <Paragraph position="1"> To solve above-mentioned problems, Chen (2004) proposed a non head-oriented character-sense association model to retrieve the latent senses of characters and the latent synonymous compounds among characters by measuring similarity of semantic template in compounding by using a MRD. However, as the author remarked in the final discussion of classification errors, the performance of this model relies much on the productivity of compounding semantic templates of the target compounds. To correctly predict the semantic category of a compound with an unproductive semantic template is no doubt very difficult due to a sparse existence of the template3Lua reports a result of 14.14% (Z3 type). similar compounds. In addition, the statistical measure of sense association does not tell us any more about the constraints and knowledge of conceptual combination.</Paragraph>
    <Paragraph position="2"> In the following, we will propose that a knowledge resource at the morpheme (character) level could be a straightforward remedy to these problems. By treating characters as instances of conceptual primitives, a character ontology thereof might provide an interpretation of conceptual grounding of word senses. At a coarse grain, the character ontological model does have advantages in efficiently defining the conceptual space within which character-grounded concept primitives and their relations, are implicitly located.</Paragraph>
  </Section>
  <Section position="5" start_page="57" end_page="61" type="metho">
    <SectionTitle>
4 A Proposed Character
Ontology-based Approach
</SectionTitle>
    <Paragraph position="0"> In carrying out the semantic prediction task, we presume the context-freeness hypothesis, i.e., without resorting to any contextual information.</Paragraph>
    <Paragraph position="1"> The consideration is taken based on the observation that native speaker seems to reconstruct their new conceptual structure locally in the processing of unknown compound words. On the other hand, it has the advantage especially for those unknown words that occur only once and hence have limited context.</Paragraph>
    <Paragraph position="2"> In general, the approach proposed here differs in some ways from previous research based on the following presuppositions:</Paragraph>
    <Section position="1" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
4.1 Character Ontology as a Knowledge
Resource
</SectionTitle>
      <Paragraph position="0"> The new model that we will present below will rely on a coarsely grained upper-level ontology of characters.4 This character ontology is a tree-structured conceptual taxonomy in terms of which only two kinds of relations are allowed: the INSTANCE-OF (i.e., certain characters are instances of certain concept types) and IS-A relations (i.e., certain concept type is a kind of certain concept type).</Paragraph>
      <Paragraph position="1"> In the character ontology, monosyllabic characters 5 are assigned to at least 6 one of 309 consets (concept set), a new term which is defined as a type of concept sharing a given putatively primitive meaning. For instance, z (speak), [?] (chatter), x (say), ; (say), u (tell), s (inform), f (explain), A (narrate), , (be called), H (state), these characters are assigned to the same conset.</Paragraph>
      <Paragraph position="2"> Following the basic line of OntoClear methodology (Guarino and Welty (2002)), we use simple monotonic inheritance, which means that each node inherits properties only from a single ancestor, and the inherited value cannot be overwritten at any point of the ontology. The decision to keep the relations to one single parent was made in order to guarantee that the structure would be able to grow indefinitely and still be manageable, i.e.</Paragraph>
      <Paragraph position="3"> that the transitive quality of the relations between the nodes would not degenerate with size. Figure 1 shows a snapshot of the character ontology.</Paragraph>
    </Section>
    <Section position="2" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
4.2 Character-triggered Latent
Near-synonyms
</SectionTitle>
      <Paragraph position="0"> The rationale behind this approach is that similar conceptual primitives - in terms of characters - probably participate in similar context or have similar meaning-inducing functions. This can be rephrased as the following presumptions: (1).</Paragraph>
      <Paragraph position="1"> Near-synonymic words often overlap in senses, i.e., they have same or close semantic classes. (2).</Paragraph>
      <Paragraph position="2"> Words with characters which share similar conceptual information tend to form a latent cluster  of synonyms. (2). These similar conceptual information can be formalized as conceptual aggregate patterns extracted from a character ontology. (3). Identifying such conceptual aggregate patterns might thus greatly benefit the automatically acquired near-synonyms, which give a set of good candidates in predicting the semantic class of previously unknown ones.</Paragraph>
      <Paragraph position="3"> The proposed semantic classification system retrieves at first a set of near-synonym candidates using conceptual aggregation patterns. Considerations from the view of lexicography can winnow the overgenerated candidates, that is, a final decision of a list of near-synonym candidates is formed on the basis of the CILIN's verdict as to what latent near-synonyms are. Thus the semantic class of the target unknown two-character words will be assigned with the semantic class of the top-ranked near-synonym calculated by the similarity measurement between them. This method has advantage of avoiding the snag of apparent multiplicity of semantic usages (ambiguity) of a character.</Paragraph>
      <Paragraph position="4"> Take for an example. Suppose that the semantic class of a two-character word \^ (protect; Hi37) is unknown. By presuming the leftmost character \ the head of the word, and the right-most character ^ as the modifier of the word,  we first identify the conset which the modifier ^ belongs to. Other instances in this conset are \, o, {, &amp;quot;, 7, G, e, ., 1, e, o, etc. So we can retrieve a set of possible near-synonym candidates by substitution, namely, NS1: {\\, \o, \{, \&amp;quot;, \7, \G, \e, \., \1, \e, \o}; in the same way, by presuming ^ as the head, we have a second set of possible near-synonym candidates, NS2: {^^, o^, {^, &amp;quot;^, 7^, G^, e^, .^, 1^, e^, o^}7. Aligned with CILIN, those candidates which are also listed in the CILIN are adopted as the final two list of the near-synonym candidates for the unknown word \^: NSprime1: {o^(Hi41), &amp;quot;^(Hb04;Hi37), 7^(Hi47), e ^(Hi37),o^(Hd01)}, and NSprime2: {\G(Hl33),\ o(Hj33), \e(Ee39)}.</Paragraph>
    </Section>
    <Section position="3" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
4.3 Semantic Similarity Measure of
Unknown Word and its Near-Synonyms
</SectionTitle>
      <Paragraph position="0"> Given two sets of character-triggered near-synonyms candidates, the next step is to calculate the semantic similarity between the unknown word (UW) and these near-synonyms candidates.</Paragraph>
      <Paragraph position="1"> CILIN Thesaurus is a tree-structured taxonomic semantic structure of Chinese words, which can be seen as a special case of semantic network. To calculate semantic similarity between nodes in the network can thus make use of the structural information represented in the network. null Following this information content-based model, in measuring the semantic similarity between unknown word and its candidate near-synonymic words, we use a measure metric modelled on those of Chen and Chen (2000), which is a simplification of the Resnik algorithm by assuming that the occurrence probability of each leaf node is equal. Given two sets (NSprime1,NSprime2) of candidate near synonyms, each with m and n near synonyms respectively, the similarity is calculated as in equation (1) and (2), where scuwc1 and scuwc2 are the semantic class(es) of the first and second morphemic component (i.e., character) of a given unknown word, respectively. sci and scj are the semantic classes of the first and second morphemic components on the list of candidate near-synonyms NSprime1 7Note that in this case, \ and ^ are happened to be in the same conset.</Paragraph>
      <Paragraph position="2"> and NSprime2. f is the frequency of the semantic classes, and the denominator is the total value of numerator for the purpose of normalization. b and 1[?]b are the weights which will be discussed later. The Information Load (IL) of a semantic class sc is defined in Chen and Chen (2004):</Paragraph>
      <Paragraph position="4"> if there is q the number of the minimal semantic classes in the system,8 p is the number of the semantic classes subordinate sc.</Paragraph>
    </Section>
    <Section position="4" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
4.4 Circumventing &amp;quot;Head-oriented&amp;quot;
Presupposition
</SectionTitle>
      <Paragraph position="0"> As remarked in Chen (2004), the previous research concerning the automatic semantic classification of Chinese compounds (Lua 1997; Chen and Chen 2000) presupposes the endocentric feature of compounds. That is, by supposing that compounds are composed of a head and a modifier, determining the semantic category of the target therefore boils down to determine the semantic category of the head compound.</Paragraph>
      <Paragraph position="1"> In order to circumventing the strict &amp;quot;headdetermination&amp;quot; presumption, which might suffer problems in some borderline cases of V-V compounds, the weight value (b and 1 [?] b) is proposed. The idea of weighting comes from the discussion of morphological productivity in Baayen (2001). We presume that, within a given two-character words, the more productive, that is, the more numbers of characters a character can combine with, the more possible it is a head, and the more weight should be given to it.</Paragraph>
      <Paragraph position="2"> The weight is defined as b = C(n,1)N , viz, the number of candidate morphemic components divided by the total number of N. For instance, in the above-mentioned example, NS1 should gain more weights than NS2, for ^ can combine with more characters (5 near-synonyms candidates) in</Paragraph>
      <Paragraph position="4"> didates). In this case, b = 58 = 0.625. It is noted that the weight assignment should be character and position independent.</Paragraph>
    </Section>
    <Section position="5" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
4.5 Experimental Settings
</SectionTitle>
      <Paragraph position="0"> The following resources are used in the experiments: (1)Sinica Corpus9, (2) CILIN Thesaurus (Mei et al 1998) and (3) a Chinese character upper-level ontology.10 (1) is a well known balanced Corpus for modern Chinese used in Taiwan. (2) CILIN Thesaurus is a Chinese Thesaurus widely accepted as a semantic categorization standard of Chinese word in Chinese NLP. In CILIN, a collection of about 52,206 Chinese words are grouped in a Roget's Thesaurus-like structure based on categories within which there are several 3 levels of finer clustering (12 major, 95 minor and 1428 minor semantic classes).(3) is an on-going project of Hanzi-grounded Ontology and Lexicon as introduced.</Paragraph>
      <Paragraph position="1">  We conducted an open test experiment, which meant that the training data was different from the testing data. 800 two-character words in CILIN were chosen at random to serve as test data, and all the words in the test set were assumed to be unknown. The distribution of the grammatical categories of these data is: NN (200, 25%), VN (100, 12.5%) and VV (500, 62.5%).</Paragraph>
      <Paragraph position="2">  The baseline method assigns the semantic class of the randomly picked head component to the semantic class of the unknown word in question. It is noted that most of the morphemic components  (characters) are ambiguous, in such cases, semantic class is chosen at random as well.</Paragraph>
      <Paragraph position="3">  Briefly, the strategy to predict the semantic class of a unknown two-character word is, to measure the semantic similarity of unknown words and their candidate near-synonyms which are retrieved based on the character ontology. For any unknown word UW, which is the character sequence of C1C2, the RANK(simu(b),simn(1 [?] b)) is computed. The semantic category sc of the candidate synonym which has the value of MAX(simu(b),simn(1 [?] b)), will be the top-ranked guess for the target unknown word.</Paragraph>
    </Section>
    <Section position="6" start_page="59" end_page="61" type="sub_section">
      <SectionTitle>
4.6 Results and Error Analysis
</SectionTitle>
      <Paragraph position="0"> The correctly predicted semantic class is the sematic class listed in CILIN. In the case of ambiguity, when the unknown word in question belongs to more than one semantic classes, any one of the classes of an ambiguous word is considered correct in the evaluation.</Paragraph>
      <Paragraph position="1"> The SC prediction algorithm was performed on the test data for outside test in level-3 classification. The resulting accuracy is shown in Table 2. For the purpose of comparison, Table 3 also shows the more shallow semantic classification (the 2nd level in CILIN).</Paragraph>
      <Paragraph position="2"> Generally, without contextual information, the classifier is able to predict the meaning of a Chinese two-character words with satisfactory accu- null racy against the baseline. A further examination of the bad cases indicates that error can be grouped into the following sources: * Words with no semantic transparency: Like &amp;quot;proper names&amp;quot;, these types have no semantic transparency property, i.e., the word meanings can not be derived from their morphemic components. Loan words such as c144 e (/sh-af-a/; &amp;quot;sofa&amp;quot;) are typical examples. * Words with weak semantic transparency: These can be further classified into four types: - Appositional compounds: words whose two characters stand in a coordinate relationship, e.g. %0a (&amp;quot;east-west&amp;quot;, thing). - Lexicalized idiomatic usage: For such usage, each word is an indivisible construct and each has its meaning which can hardly be computed by adding up the separate meaning of the components of the word. The sources of these idiomatic words might lie in the etymological past and are at best meaningless to the modern native speaker. e.g, &lt;(r) (&amp;quot;salary-water&amp;quot;, salary).</Paragraph>
      <Paragraph position="3"> - Metaphorical usage: the meaning of such words are therefore different from the literal meaning. Some testing data is not semantically transparent due to their metaphorical uses, For instance, , I (Aj) is assigned to the ,a (Bk).</Paragraph>
      <Paragraph position="4"> * Derived words: Such as a (enter). These could be filter out using syntactical information.</Paragraph>
      <Paragraph position="5"> * The quality and coverage of CILIN and character ontology: Since our SC system's test and training data are gleaned from CILIN and the character Compound types Our model Current best model  a comparison ontology, the quality and coverage play a crucial role. For example, for the unknown compound word AA (/s-ao-s-ao/; &amp;quot;be in tumult&amp;quot;), there not even an example which has A as the first character or as the second character. the same problem such as falling short on coverage and data sparseness goes to the character ontology, too. For instance, there are some dissyllabic morphemes which are not listed in ontology, such as .V (/j`iy'u/;&amp;quot;covet&amp;quot;).</Paragraph>
    </Section>
    <Section position="7" start_page="61" end_page="61" type="sub_section">
      <SectionTitle>
4.7 Evaluation
</SectionTitle>
      <Paragraph position="0"> So far as we know, no evaluation in the previous works was done. This might be due to many reasons: (1) the different scale of experiment (how many words are in the test data?), (2) the selection of syntactic category (VV, VN or NN?) of morphemic components, and (3) the number of morphemic components involved (two or three-character words?).. etc. Hence it is difficult to compare our results to other models. Among the current similar works, Table 4 shows that our system outperforms Chen(2004) in VV compounds, and approximates the Chen and Chen(2000) in NN compounds.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML