File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1038_intro.xml
Size: 3,867 bytes
Last Modified: 2025-10-06 14:02:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1038"> <Title>Chinese Verb Sense Discrimination Using an EM Clustering Model with Rich Linguistic Features</Title> <Section position="2" start_page="0" end_page="2" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Highly ambiguous words may lead to irrelevant document retrieval and inaccurate lexical choice in machine translation (Palmer et al., 2000), which suggests that word sense disambiguation (WSD) is beneficial and sometimes even necessary in such NLP tasks. This paper addresses WSD in Chinese through developing an Expectation-Maximization (EM) clustering model to learn Chinese verb sense distinctions. The major goal is to do sense discrimination rather than sense labeling, similar to (Schtze, 1998). The basic idea is to divide instances of a word into several clusters that have no sense labels. The instances in the same cluster are regarded as having the same meaning. Word sense discrimination can be applied to document retrieval and similar tasks in information access, and to facilitating the building of large annotated corpora. In addition, since the clustering model can be trained on large unannotated corpora and evaluated on a relatively small sense-tagged corpus, it can be used to find indicative features for sense distinctions through exploring huge amount of available unannotated text data.</Paragraph> <Paragraph position="1"> The EM clustering algorithm (Hofmann and Puzicha, 1998) used here is an unsupervised machine learning algorithm that has been applied in many NLP tasks, such as inducing a semantically labeled lexicon and determining lexical choice in machine translation (Rooth et al., 1998), automatic acquisition of verb semantic classes (Schulte im Walde, 2000) and automatic semantic labeling (Gildea and Jurafsky, 2002). In our task, we equipped the EM clustering model with rich linguistic features that capture the predicate-argument structure information of verbs and restricted the feature set for each verb using knowledge from dictionaries. We also semi-automatically built a semantic taxonomy for Chinese nouns based on two Chinese electronic semantic dictionaries, the Hownet dictionary and the Rocling dictionary.</Paragraph> <Paragraph position="2"> The 7 top-level categories of this taxonomy were used as semantic features for the model. Since external knowledge is used to obtain the semantic features and guide feature selection, the model is not completely unsupervised from this perspective; however, it does not make use of any annotated training data. Two external quality measures, purity and normalized mutual information (NMI) (Strehl.</Paragraph> <Paragraph position="3"> 2002), were used to evaluate the models performance on 12 Chinese verbs. The experimental results show that rich linguistic features and the semantic taxonomy are both very useful in sense discrimination. The model generally performs well in learning sense group distinctions for difficult, highly polysemous verbs and sense distinctions for other verbs. Enhanced by certain fine-grained semantic categories called lexical sets (Hanks, 1996), the models performance improved in a preliminary experiment for the three most difficult verbs chosen from the first set of experiments.</Paragraph> <Paragraph position="4"> The paper is organized as follows: we briefly introduce the EM clustering model in Section 2 and describe the features used by the model in Section 3. In Section 4, we introduce a semantic taxonomy for Chinese nouns, which is built semi-automatically for our task but can also be used in other NLP tasks such as co-reference resolution and relation detection in information extraction. We report our experimental results in Section 5 and conclude our discussion in Section 6.</Paragraph> </Section> class="xml-element"></Paper>