XML Viewer - p04-1038

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1038_metho.xml
Size: 27,531 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1038">
  <Title>Chinese Verb Sense Discrimination Using an EM Clustering Model with Rich Linguistic Features</Title>
  <Section position="3" start_page="2" end_page="21" type="metho">
    <SectionTitle>
2 EM Clustering Model
</SectionTitle>
    <Paragraph position="0"> The basic idea of our EM clustering approach is similar to the probabilistic model of co-occurrence described in detail in (Hofmann and Puzicha 1998). In our model, we treat a set of features</Paragraph>
    <Paragraph position="2"> , which are extracted from the parsed sentences that contain a target verb, as observed variables. These variables are assumed to be independent given a hidden variable c, the sense of the target verb. Therefore the joint probability of the observed variables (features) for each verb instance, i.e., each parsed sentence containing the target verb, is defined in equation (1),</Paragraph>
    <Paragraph position="4"> f s are discrete-valued features that can take multiple values. A typical feature used in our model is shown in (2),</Paragraph>
    <Paragraph position="6"> (2) At the beginning of training (i.e., clustering), the models parameters )(cp and )|( cfp i are randomly initialized.</Paragraph>
    <Paragraph position="7">  Then, the probability of c conditioned on the observed features is computed in the expectation step (E-step), using equation (3),</Paragraph>
    <Paragraph position="9"> In our experiments, for verbs with more than 3 senses, syntactic and semantic restrictions derived from dictionary entries are used to constrain the random initialization.</Paragraph>
    <Paragraph position="10"> In the maximization step (M-step), )(cp and</Paragraph>
    <Paragraph position="12"> are re-computed by maximizing the log-likelihood of all the observed data which is calculated by using ),...,,|(</Paragraph>
    <Paragraph position="14"> fffcp estimated in the E-step. The E-step and M-step are repeated for a fixed number of rounds, which is set to 20 in our experiments,  or till the amount of change of )(cp and )|( cfp i is under the threshold 0.001. When doing classification, for each verb instance, the model calculates the same conditional probability as in equation (3) and assigns the instance to the cluster with the maximal</Paragraph>
    <Paragraph position="16"/>
  </Section>
  <Section position="4" start_page="21" end_page="21" type="metho">
    <SectionTitle>
3 Features Used in the Model
</SectionTitle>
    <Paragraph position="0"> The EM clustering model uses a set of linguistic features to capture the predicate-argument structure information of the target verbs. These features are usually more indicative of verb sense distinctions than simple features such as words next to the target verb or their POS tags. For example, the Chinese verb Chu  |chu1 has a sense of produce, the distinction between this sense and the verbs other senses, such as happen and go out, largely depends on the semantic category of the verbs direct object. Typical examples are shown in (1),  In their county, you can see mountains as soon as you step out of the doors.</Paragraph>
    <Paragraph position="1"> The verb has the sense produce in (1a) and its object should be something producible, such as Xiang Jiao /banana. While in (1b), with the sense happen, the verb typically takes an event or eventlike Da Shi object, such as /big event, Shi Gu /accident or Wen Ti /problem etc. In (1c), Men the verbs object /door is closely related to location, consistent with the sense go out. In contrast, simple lexical or POS tag features sometimes fail to capture such information, which can be seen clearly in (2),  In our experiments, we set 20 as the maximal number of rounds after trying different numbers of rounds (20, 40, 60, 80, 100) in a preliminary  experiment.</Paragraph>
    <Paragraph position="2"> 0 iff the target verb has no sentential complement 1 iff the target verb has a nonfinite sentential complement 2 iff the target verb has a finite sentential complement Qu Nian Chu (2) a. /last year /produce Xiang Jiao /banana 3000 Gong Jin / kilogram 3000 kilograms of bananas were produced last year.</Paragraph>
    <Paragraph position="3"> Yao Chu b. /in order to /produce Hai Nan /Hainan Zui Hao De Xiang Jiao /best /DE /banana  In order to produce the best bananas in Hainan, The verbs object Xiang Jiao /banana, which is next to the verb in (2a), is far away from the verb in (2b). For (2b), a classifier only looking at the adjacent positions of the target verb tends to be misled by the NP right after the verb, i.e., Hai Nan /Hainan, which is a Province in China and a typical object of the verb with the sense go out. Five types of features are used in our model:  1. Semantic category of the subject of the target verb 2. Semantic category of the object of the target verb 3. Transitivity of the target verb 4. Whether the target verb takes a sentential complement and which type of sentential complement (finite or nonfinite) it takes 5. Whether the target verb occurs in a verb compound  We obtain the values for the first two types of features (1) and (2) from a semantic taxonomy for Chinese nouns, which we will introduce in detail in the next section.</Paragraph>
    <Paragraph position="4"> In our implementation, the model uses different features for different verbs. The criteria for feature selection are from the electronic CETA dictionary file  and a hard copy English-Chinese dictionary, The Warmth Modern Chinese-English Dictionary.  For example, the verb Chu |chu1 never takes sentential complements, thus the fourth type of feature is not used for it. It could be supposed that we can still have a uniform model, i.e., a model using the same set of features for all the target verbs, and just let the EM clustering algorithm find useful features for different verbs automatically. The problem here is that unsupervised learning models (i.e., models trained on unlabeled data) are more likely to be affected by noisy data than supervised ones. Since all the features used in our model are extracted from automatically parsed sentences that inevitably have preprocessing errors such as segmentation, POS tagging and parsing errors, using verb-specific sets of features can alleviate the problem caused by noisy data to some extent. For example, if the model already knows  Licensed from the Department of Defense  The Warmth Modern Chinese-English Dictionary, Wang-Wen Books Ltd, 1997.</Paragraph>
    <Paragraph position="5"> that a verb like Chu |chu1 can never take sentential complements (i.e., it does not use the fourth type of feature for that verb), it will not be misled by erroneous parsing information saying that the verb takes sentential complements in certain sentences. Since the corresponding feature is not included, the noisy data is filtered out. In our EM clustering model, all the features selected for a target verb are treated in the same way, as described in Section 2.</Paragraph>
  </Section>
  <Section position="5" start_page="21" end_page="21" type="metho">
    <SectionTitle>
4 A Semantic Taxonomy Built Semi-
</SectionTitle>
    <Paragraph position="0"> automatically Examples in (1) have shown that the semantic category of the object of a verb sometimes is crucial in distinguishing certain Chinese verb senses. And our previous work on information extraction in Chinese (Chen et al., 2004) has shown that semantic features, which are more general than lexical features but still contain rich information about words, can be used to improve a models capability of handling unknown words, thus alleviating potential sparse data problems. We have two Chinese electronic semantic dictionaries: the Hownet dictionary, which assigns 26,106 nouns to 346 semantic categories, and the Rocling dictionary, which assigns 4,474 nouns to 110 semantic categories.</Paragraph>
    <Paragraph position="1">  A preliminary experimental result suggests that these semantic categories might be too fine-grained for the EM clustering model (see Section 5.2 for greater details). An analysis of the sense distinctions of several Chinese verbs also suggests that more general categories on top of the Hownet and Rocling categories could still be informative and most importantly, could enable the model to generate meaningful clusters more easily. We therefore built a three-level semantic taxonomy based on the two semantic dictionaries using both automatic methods and manual effort.</Paragraph>
    <Paragraph position="2"> The taxonomy was built in three steps. First, a simple mapping algorithm was used to map semantic categories defined in Hownet and Rocling into 27 top-level WordNet categories.</Paragraph>
    <Paragraph position="3">  The Hownet or Rocling semantic categories have English glosses. For each category gloss, the algorithm looks through the hypernyms of its first sense in WordNet and chooses the first WordNet top-level category it finds.</Paragraph>
    <Paragraph position="4">  Hownet assigns multiple entries (could be different semantic categories) to polysemous words. The Rocling dictionary we used only assigns one entry (i.e., one semantic category) to each noun.</Paragraph>
    <Paragraph position="5">  The 27 categories contain 25 unique beginners for noun source files in WordNet, as defined in (Fellbaum, 1998) and two higher level categories Entity and Abstraction.</Paragraph>
    <Paragraph position="6"> The mapping obtained from step 1 needs further modification for two reasons. First, the glosses of Hownet or Rocling semantic categories usually have multiple senses in WordNet. Sometimes, the first sense in WordNet for a category gloss is not its intended meaning in Hownet or Rocling. In this case, the simple algorithm cannot get the correct mapping. Second, Hownet and Rocling sometimes use adjectives or non-words as category glosses, such as animate and LandVehicle etc., which have no WordNet nominal hypernyms at all. However, those adjectives or non-words usually have straightforward meanings and can be easily reassigned to an appropriate WordNet category. Although not accurate, the automatic mapping in step 1 provides a basic framework or skeleton for the semantic taxonomy we want to build and makes subsequent work easier.</Paragraph>
    <Paragraph position="7"> In step 2, hand correction, we found that we could make judgments and necessary adjustments on about 80% of the mappings by only looking at the category glosses used by Hownet or Rocling, such as livestock, money, building and so on. For the other 20%, we could make quick decisions by looking them up in an electronic table we created. For each Hownet or Rocling category, our table lists all the nouns assigned to it by the two dictionaries. We merged two WordNet categories into others and subdivided three categories that seemed more coarse-grained than others into 2~5 subcategories. Step 2 took three days and 35 intermediate-level categories were generated.</Paragraph>
    <Paragraph position="8"> In step 3, we manually clustered the 35 intermediate-level categories into 7 top-level semantic categories. Figure 1 shows part of the taxonomy.</Paragraph>
    <Paragraph position="9"> The EM clustering model uses the 7 top-level categories to define the first two types of features that were introduced in Section 3. For example, the value of a feature k f is 1 if and only if the object NP of the target verb belongs to the semantic category Event and is otherwise 0.</Paragraph>
  </Section>
  <Section position="6" start_page="21" end_page="21" type="metho">
    <SectionTitle>
5 Clustering Experiments
</SectionTitle>
    <Paragraph position="0"> Since we need labeled data to evaluate the clustering performance but have limited sense-tagged corpora, we applied the clustering model to 12 Chinese verbs in our experiments. The verbs are chosen from 28 annotated verbs in Penn Chinese Treebank so that they have at least two verb meanings in the corpus and for each of them, the number of instances for a single verb sense does not exceed 90% of the total number of instances.</Paragraph>
    <Paragraph position="1"> In our task, we generally do not include senses for other parts of speech of the selected words, such as noun, preposition, conjunction and particle etc., since the parser we used has a very high accuracy in distinguishing different parts of speech of these words (&gt;98% for most of them).</Paragraph>
    <Paragraph position="2"> However, we do include senses for conjunctional and/or prepositional usage of two words, Dao |dao4 and Wei |wei4, since our parser cannot distinguish the verb usage from the conjunctional or prepositional usage for the two words very well.</Paragraph>
    <Paragraph position="3"> Five verbs, the first five listed in Table 1, are both highly polysemous and difficult for a supervised word sense classifier (Dang et al., 2002).</Paragraph>
    <Paragraph position="4">  In our experiments, we manually grouped the verb senses for the five verbs. The criteria for the grouping are similar to Palmer et al.s (to appear) work on English verbs, which considers both sense coherence and predicate-argument structure distinctions. Figure 2 gives an example of  In the supervised task, their accuracies are lower than 85%, and four of them are even lower than the baselines.</Paragraph>
    <Paragraph position="5">  are Time, Human, Animal and State) the definition of sense groups. The manually defined sense groups are used to evaluate the models performance on the five verbs.</Paragraph>
    <Paragraph position="6"> The model was trained on an unannotated corpus, Peoples Daily News (PDN), and tested on the manually sense-tagged Chinese Treebank (with some additional sense-tagged PDN data).</Paragraph>
    <Paragraph position="7">  We parsed the training and test data using a Maximum Entropy parser and extracted the features from the parsed data automatically. The number of clusters used by the model is set to the number of the defined senses or sense groups of each target verb. For each verb, we ran the EM clustering algorithm ten times. Table 2 shows the average performance and the standard deviation for each verb. Table 1 summarizes the data used in the experiments, where we also give the normalized sense perplexity  of each verb in the test data.</Paragraph>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.1 Evaluation Methods
</SectionTitle>
      <Paragraph position="0"> We use two external quality measures, purity and normalized mutual information (NMI) (Strehl.</Paragraph>
      <Paragraph position="1"> 2002) to evaluate the clustering performance.</Paragraph>
      <Paragraph position="2"> Assuming a verb has l senses, the clustering model assigns n instances of the verb into k clusters,</Paragraph>
      <Paragraph position="4"> the size of the ith cluster, j n is the number of instances hand-tagged with the jth sense, and</Paragraph>
      <Paragraph position="6"> the number of instances with the jth sense in the ith cluster, purity is defined in equation (4):</Paragraph>
      <Paragraph position="8"> The sense-tagged PDN data we used here are the same as in (Dang et al., 2002).</Paragraph>
      <Paragraph position="9">  It is calculated as the entropy of the sense distribution of a verb in the test data divided by the largest possible entropy, i.e., log  (the number of senses of the verb in the test data).</Paragraph>
      <Paragraph position="10"> It can be interpreted as classification accuracy when for each cluster we treat the majority of instances that have the same sense as correctly classified. The baseline purity is calculated by treating all instances for a target verb in a single cluster. The purity measure is very intuitive. In our case, since the number of clusters is preset to the number of senses, purity for verbs with two senses is equal to classification accuracy defined in supervised WSD. However, for verbs with more than 2 senses, purity is less informative in that a clustering model could achieve high purity by making the instances of 2 or 3 dominant senses the majority instances of all the clusters.</Paragraph>
      <Paragraph position="11"> Mutual information (MI) is more theoretically well-founded than purity. Treating the verb sense and the cluster as random variables S and C, the MI between them is defined in equation (5):</Paragraph>
      <Paragraph position="13"> uncertainty of one random variable S (or C) due to knowing the other variable C (or S). A single cluster with all instances for a target verb has a zero MI. Random clustering also has a zero MI in the limit. In our experiments, we used [0,1]normalized mutual information (NMI) (Strehl.</Paragraph>
      <Paragraph position="14"> 2002). A shortcoming of this measure, however, is that the best possible clustering (upper bound) evaluates to less than 1, unless classes are balanced. Unfortunately, unbalanced sense distribution is the usual case in WSD tasks, which makes NMI itself hard to interpret. Therefore, in addition to NMI, we also give its upper bound (upper-NMI) and the ratio of NMI and its upper bound (NMI-ratio) for each verb, as shown in columns 6 to 8 in Table 2.</Paragraph>
      <Paragraph position="15"> Senses for Dao |dao4 Sense groups for Dao |dao4  1. to go to, leave for 2. to come 3. to arrive 4. to reach a particular stage, condition, or level 5. marker for completion of activities (after a verb) 6. marker for direction of activities (after a verb) 7. to reach a time point 8. up to, until (prepositional usage) 9. up to, until, (from ) to (conjunctional usage)</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Table 2 summarizes the experimental results for the 12 Chinese verbs. As we see, the EM clustering model performs well on most of them, except the verb Yao |yao4.</Paragraph>
      <Paragraph position="1">  The NMI measure NMI-ratio turns out to be more stringent than purity. A high purity does not necessarily mean a high NMI-ratio. Although intuitively, NMI-ratio should be related to sense perplexity and purity, it is hard to formalize the relationships between them from the results. In fact, the NMI-ratio for a particular verb is eventually determined by its concrete sense distribution in the test data and the models clustering behavior for that verb. For example, the verbs Chu |chu1 and Jian |jian4 have the same sense perplexity and Jian |jian4 has a higher purity than Chu |chu1 (72.20% vs. 63.31%), but the NMI-ratio for Jian |jian4 is much lower than Chu |chu1 (22.41% vs. 43.24%). An analysis of the  For all the verbs except Yao |yao4, the models purities outperformed the baseline purities significantly (p&lt;0.05, and p&lt;0.001 for 8 of them).</Paragraph>
      <Paragraph position="2"> classification results for Jian |jian4 shows that the clustering model made the instances of the verbs most dominant sense the majority instances of three clusters (of total 5 clusters), which is penalized heavily by the NMI measure.</Paragraph>
      <Paragraph position="3"> Rich linguistic features turn out to be very effective in learning Chinese verb sense distinctions. Except for the two verbs, Fa Xian |fa1xian4 and Biao Shi |biao3shi4, the sense distinctions of which can usually be made only by syntactic alternations,  features such as semantic features or combinations of semantic features and syntactic alternations are very beneficial and sometimes even necessary for learning sense distinctions of other verbs. For example, the verb Jian |jian4 has one sense see, in which the verb typically takes a Human subject and a sentential complement, while in another sense show, the verb typically takes an Entity subject and a State object. An inspection of the classification results shows  For example, the verb Fa Xian |fa1xian4 takes an object in one sense discover and a sentential complement in the other sense realize.</Paragraph>
      <Paragraph position="4">  by purity and normalized mutual information (NMI) that the EM clustering model has indeed learned such combinatory patterns from the training data. The experimental results also indicate that the semantic taxonomy we built is beneficial for the task. For example, the verb Tou Ru |tou1ru4 has two senses, input and plunge into. It typically takes an Event object for the second sense but not for the first one. A single feature obtained from our semantic taxonomy, which tests whether the verb takes an Event object, captures this property neatly (achieves purity 95.65% and NMI-ratio 78.38% when using 2 clusters). Without the taxonomy, the top-level category Event is split into many fine-grained Hownet or Rocling categories, which makes it very difficult for the EM clustering model to learn sense distinctions for this verb. In fact, in a preliminary experiment only using the Hownet and Rocling categories, the model had the same purity as the baseline (52.17%) and a low NMI-ratio (4.22%) when using 2 clusters. The purity improved when using more clusters (70.43% with 4 clusters and 76.09% with 6), but it was still much lower than the purity achieved by using the semantic taxonomy and the NMI-ratio dropped further (1.19% and 1.20% for the two cases).</Paragraph>
      <Paragraph position="5"> By looking at the classification results, we identified three major types of errors. First, preprocessing errors create noisy data for the model. Second, certain sense distinctions depend heavily on global contextual information (crosssentence information) that is not captured by our model. This problem is especially serious for the verb Yao |yao4. For example, without global contextual information, the verb can have at least three meanings want, need or should in the same clause, as shown in (3).</Paragraph>
      <Paragraph position="6"> (3) Ta Yao Ma Shang /he /want/need/should /at once Du Wan Zhe Ben Shu /finish reading /this /book.</Paragraph>
      <Paragraph position="7"> He wants to/needs to/should finish reading this book at once.</Paragraph>
      <Paragraph position="8"> Third, a target verb sometimes has specific types of NP arguments or co-occurs with specific types of verbs in verb compounds in certain senses. Such information is crucial for distinguishing these senses from others, but is not captured by the general semantic taxonomy used here. We did further experiments to investigate how much improvement the model could gain by capturing such information, as discussed in Section 5.3.</Paragraph>
    </Section>
    <Section position="3" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
5.3 Experiments with Lexical Sets
</SectionTitle>
      <Paragraph position="0"> As discussed by Patrick Hanks (1996), certain senses of a verb are often distinguished by very narrowly defined semantic classes (called lexical sets) that are specific to the meaning of that verb sense. For example, in our case, the verb Hui Fu |hui1fu4 has a sense recover in which its direct object should be something that can be recovered naturally. A typical set of object NPs of the verb for this particular sense is partially listed  Most words in this lexical set belong to the Hownet category attribute and the top-level category State in our taxonomy. However, even the lower-level category attribute still contains many other words irrelevant to the lexical set, some of which are even typical objects of the verb for two other senses, resume and regain, such as Bang Jiao /diplomatic relations in Hui Fu /resume Bang Jiao /diplomatic relations and Ming Yu /reputation in Hui Fu /regainMing Yu /reputation. Therefore, a lexical set like (4) is necessary for distinguishing the recover sense from other senses of the verb.</Paragraph>
      <Paragraph position="1"> It has been argued that the extensional definition of lexical sets can only be done using corpus evidence and it cannot be done fully automatically (Hanks, 1997). In our experiments, we use a bootstrapping approach to obtain five lexical sets semi-automatically for three verbs Chu |chu1, Jian |jian4 and Hui Fu |hui1fu4 that have both low purity and low NMI-ratio in the first set of experiments.</Paragraph>
      <Paragraph position="2">  We first extracted candidates for the lexical sets from the training data. For example, we extracted all the direct objects of the verb Hui Fu |hui1fu4 and all the verbs that combined with the verb Chu |chu1 to form verb compounds from the automatically parsed training data. From the candidates, we manually selected words to form five initial seed sets, each of which contains no more than ten words. A simple algorithm was used to search for all the words that have the same detailed Hownet semantic definitions (semantic category plus certain supplementary information) as the seed words. We did not use Rocling because its semantic definitions are so general that a seed word tends to extend to a huge set of irrelevant words. Highly relevant words were manually selected from all the words found by the searching algorithm and added to the initial seed sets. The enlarged sets were used as lexical sets.</Paragraph>
      <Paragraph position="3"> The enhanced model first uses the lexical sets to obtain the semantic category of the NP arguments  We did not include Yao |yao4, since its meaning rarely depends on local predicate-argument structure information.</Paragraph>
      <Paragraph position="4"> of the three verbs. Only when the search fails does the model resort to the general semantic taxonomy. The model also uses the lexical sets to determine the types of the compound verbs that contain the target verb Chu |chu1 and uses them as new features.</Paragraph>
      <Paragraph position="5"> Table 3 shows the models performance on the three verbs with or without using lexical sets. As we see, lexical sets improves the models performance on all of them, especially on the verb Chu |chu1. Although the results are still preliminary, they nevertheless provide us hints of how much a WSD model for Chinese verbs could gain from lexical sets.</Paragraph>
      <Paragraph position="6"> w/o lexical sets (%) with lexical sets (%)</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="21" end_page="21" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have shown that an EM clustering model that uses rich linguistic features and a general semantic taxonomy for Chinese nouns generally performs well in learning sense distinctions for 12 Chinese verbs. In addition, using lexical sets improves the models performance on three of the most challenging verbs.</Paragraph>
    <Paragraph position="1"> Future work is to extend our coverage and to apply the semantic taxonomy and the same types of features to supervised WSD in Chinese. Since the experimental results suggest that a general semantic taxonomy and more constrained lexical sets are both beneficial for WSD tasks, we will develop automatic methods to build large-scale semantic taxonomies and lexical sets for Chinese, which reduce human effort as much as possible but still ensure high quality of the obtained taxonomies or lexical sets.</Paragraph>
  </Section>
  <Section position="8" start_page="21" end_page="21" type="metho">
    <SectionTitle>
7 Acknowledgements
</SectionTitle>
    <Paragraph position="0"> This work has been supported by an ITIC supplement to a National Science Foundation Grant, NSF-ITR-EIA-0205448. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML