File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/i05-3012_relat.xml
Size: 4,512 bytes
Last Modified: 2025-10-06 14:15:50
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3012"> <Title>Integrating Collocation Features in Chinese Word Sense Disambiguation</Title> <Section position="3" start_page="87" end_page="88" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Automating word sense disambiguation tasks based on annotated corpora have been proposed.</Paragraph> <Paragraph position="1"> Examples of supervised learning methods for WSD appear in [2-4], [7-8]. The learning algorithms applied including: decision tree, decision-list [15], neural networks [7], naive Bayesian learning ([5],[11]) and maximum entropy [10].</Paragraph> <Paragraph position="2"> Among these leaning methods, the most important issue is what features will be used to construct the classifier. It is common in WSD to use contextual information that can be found in the neighborhood of the ambiguous word in training data ([6], [16-18]). It is generally true that when words are used in the same sense, they have similar context and co-occurrence information [13]. It is also generally true that the nearby context words of an ambiguous word give more effective patterns and features values than those far from it [12]. The existing methods consider features selection for context representation including both local and topic features where local features refer to the information pertained only to the given context and topical features are statistically obtained from a training corpus. Most of the recent works for English corpus including [7] and [8], which combine both local and topical information in order to improve their performance. An interesting study on feature selection for Chinese [10] has considered topical features as well as local collocational, syntactic, and semantic features using the maximum entropy model. In Dang's [10] work, collocational features refer to the local PoS information and bi-gram co-occurrences of words within 2 positions of the ambiguous word. A useful result from this work based on (about one million words) the tagged People's Daily News shows that adding more features from richer levels of linguistic information such as PoS tagging yielded no significant improvement (less than 1%) over using only the bi-gram co-occurrences information. Another similar study for Chinese [11] is based on the Naive Bayes classifier model which has taken into consideration PoS with position information and bi-gram templates in the local context. The system has a reported 60.40% in both precision and recall based on the SENSEVAL-3 Chinese training data. Even though in both approaches, statistically significant bi-gram co-occurrence information is used, they are not necessarily true collocations. For example, in the express &quot;`</Paragraph> <Paragraph position="4"> # |may have higher frequency but may introduce noise when considering it as features in disambiguating the sense &quot;human |&quot; and &quot;symbol|0 &quot; like in the example case of &quot;&quot;$# |&quot;. In our system, we do not rely on co-occurrence information. Instead, we utilize true collocation information (42 , $ ) which fall in the window size of (-5, +5) as fea- null tures and the sense of &quot;human |&quot; can be decided clearly using this features. The collocation information is a pre-prepared collocation list obtained from a collocation extraction system and verified with syntactic and semantic methods ([21], [24]).</Paragraph> <Paragraph position="5"> Yarowsky [9] used the one sense per collocation property as an essential ingredient for an unsupervised Word-Sense Disambiguation algorithm to perform bootstrapping algorithm on a more general high-recall disambiguation. A few recent research works have begun to pay attention to collocation features on WSD. Domminic [19] used three different methods called bilingual method, collocation method and UMLS (Unified Medical Language System) relation based method to disambiguate unsupervised English and German medical documents. As expected, the collocation method achieved a good precision around 79% in English and 82% in German but a very low recall which is 3% in English and 1% in German. The low recall is due to the nature of UMLS where many collocations would almost never occur in natural text. To avoid this problem, we combine the contextual features in the target context with the pre-prepared collocations list to build our classifier.</Paragraph> </Section> class="xml-element"></Paper>