File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0107_intro.xml
Size: 5,128 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0107"> <Title>HIERARCHICAL CLUSTERING OF VERBS</Title> <Section position="2" start_page="0" end_page="70" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The design of word-sense taxonornies is acknowledged as one of the most difficult (and frustrating) tasks in NLP systems.</Paragraph> <Paragraph position="1"> The decision to assign a word to a category is far from being straightforward (Nirenburg and Raskin (1987)) and often the lexicon builders do not use consistent classification pfincipia.</Paragraph> <Paragraph position="2"> Automatic approaches to the acquisition of word taxonomies have generally made use of machine readable dictionaries (MRD), for the typical definitory nature of MRD texts. For example, in Byrd et al., (1987) and other similar studies the category of a word is acquired from the first few words of a dictionary definition. Besides the well known problems of inconsistency and circularity of definitions, an inherent difficulty with this approach is that verbs can hardly be defined in terms of genus and differentiae. Verb semantics resides in the nature of the event they describe, that is better expressed by the roles played by its arguments in a sentence. Psycholinguistie studies on verb semantics outline the relevance of thematic roles, especially in eategorisation activities Keil, (1989), Jackendoff (1983) and indicate the argument structure of verbs as playing a central role in language acquisition Pinker (1989). In NLP, representing verb semantics with their thematic roles is a consolidated practice, even though theoretical researches (Pustejovski (1991)) propose more rich and formal representation frameworks.</Paragraph> <Paragraph position="3"> More recent papers Hindle (1990), Pereira and Tishby (1992) proposed to cluster nouns on the basis of a metric derived from the distribution of subject, verb and object in the texts. Both papers use as a source of information large corpora, but differ in the type of statistical approach used to determine word similarity. These studies, though valuable, leave several open problems: 1) A metric of conceptual closeness based on mere syntactic similarity is questionable, particularly if applied to verbs. In fact, the argument structure of verbs is variegated and poorly overlapping. Furthermore, subject and object relations do not fully characterize many verbs.</Paragraph> <Paragraph position="4"> 2) Many events accumulate statistical evidence only in very large corpora, even though in Pereira and Tishby (1992) the adopted notion of distributional similarity in part avoids this problem.</Paragraph> <Paragraph position="5"> 3) The description of a word is an &quot;agglomerate&quot; of its occurrences in the corpus, and it is not possible to discriminate different senses.</Paragraph> <Paragraph position="6"> 4) None of the aforementioned studies provide a method to describe and evaluate the derived categories.</Paragraph> <Paragraph position="7"> As a result, the acquired classifications seem of little use for a large-scale NLP system, and even for a linguist that is in charge of deriving the taxonomy.</Paragraph> <Paragraph position="8"> Our research is an attempt to overcome in part the aforementioned limitations. We present a corpus-driven unsupervised learning algorithm based on a modified version of COBWEB Fisher (1987), Gennari et al. (1989). The algorithm learns verb classifications through the systematic observation of verb usages in sentences. The algorithm has been tested on two domains with very different linguistic styles, a commercial and a legal corpus of about 500,000 words each.</Paragraph> <Paragraph position="9"> In section 2 we highlight the advantages that concept formation algorithms, like COBWEB, have over &quot;agglomerate&quot; statistical approaches. However, using a Machine Learning methodology for a Natural Language Processing problem required adjustments on both sides. Raw texts representing instances of verb usages have been processed to fit the feature-vector like representation needed for concept formation algorithms. The NL processor used for this task is briefly summarized in section2.1. Similarly, it was necessary to adapt COBWEB to the linguistic nature of the classification activity, since, for example, the algorithm does not discriminate different instances of the same entity, i.e. polysernic verbs, nor identical instances of different entities, i.e. verbs with the same pattern of use. These modifications are discussed in sections 2.1 trough 2.3. Finally, in section 3 we present a method to identify the basic-level categories of a classification, i.e. those that are repository of most of the lexical information about their members.</Paragraph> <Paragraph position="10"> Class descriptions and basic-level categories, as derived by our clustering algorithm, are in our view greatly helpful at addressing the intuition of a linguist towards the relevant taxonomic relations in a Oven language domain.</Paragraph> </Section> class="xml-element"></Paper>