File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0103_intro.xml
Size: 4,773 bytes
Last Modified: 2025-10-06 14:06:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0103"> <Title>Hierarchical Clustering of Words and Application to NLP Tasks</Title> <Section position="3" start_page="0" end_page="28" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> One of the fundamental issues concerning corpus-based NLP is that we can never expect to know from the training data all the necessary quantitative information for the words that might occur in the test data if the vocabulary is large enough to cope with a real world domain. In view of the effectiveness of class-based n-gram language models against the data sparseness problem (Kneser and Ney 1993, Ueberla 1995), it is expected that classes of words are also useful for NLP tasks in such a way that statistics on classes are used whenever statistics on individual words are unavailable or unreliable. An ideal type of clusters for NLP is the one which guarantees mutual substitutability, in terms of both syntactic and semantic soundness, among words in the same class (Harris 1951, Brill and Marcus 1992). Take, for example, the following sentences.</Paragraph> <Paragraph position="1"> (a) He went to the house by car.</Paragraph> <Paragraph position="2"> (b) He went to the apartment by bus.</Paragraph> <Paragraph position="3"> (c) He went to the ? by ? .</Paragraph> <Paragraph position="4"> (d) He went to the house by the sea.</Paragraph> <Paragraph position="5"> Suppose that we want to parse sentences using a statistical parser and that sentences (a) and (b) appeared in the training and test data, respectively. Since (a) is in the training data, we know that the prepositional phrase by car is attached to the main verb went, not to the noun phrase the house. Sentence (b) is quite similar to (a) in meaning, and identical to (a) in sentence structure. Now if the words apartment and bus are unknown to the parsing system *A part of this work is done when the author was at ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan.</Paragraph> <Paragraph position="6"> (i.e. never occurred in the training data), then sentence (b) must look to the system very much like (c), and it will be very hard for the parsing system to tell the difference in sentence structure between (c) and (d). However, if the system has access to a predefined set of classes of words, and if car and bus are in the same class, and house and apartme.nt are in another class, it will not be hard for the system to detect the similarity between (a) and (b) and assign the correct sentence structure to (b) without confusing it with (d). The same argument holds for an example-based machine translation system. In that case, an appropriate translation of (b) is expected to be derived with an example translation of (a) if the system has an access to the classes of words. Therefore, it is desirable that we build clustering of the vocabulary in terms of mutual substitutability.</Paragraph> <Paragraph position="7"> Furthermore, clustering is much more useful if the clusters are of variable granularity. Suppose, for example, that we have two sets of clusters, one is finer than the other, and that word-1 and word-2 are in different finer classes. With finer clusters alone, the amount of information on the association of the two words that the system can obtain from the clusters is minimal. However, if the system has a capability of falling back and checking if they belong to the same coarser class, and if that is the case, then the system can take advantage of the class information for the two words. When we extend this notion of two-level word clustering to many levels, we will have a tree representation of all the words in the vocabulary in which the root node represents the whole vocabulary and a leaf node represents a word in the vocabulary. Also, any set of nodes in the tree constitutes a partition (or clustering) of the vocabulary if there exists one and only one node in the set along the path from the root node to each leaf node. In the following sections, we will first describe a method of creating a binary tree representation of the vocabulary and present results of evaluating and comparing the quality of the clusters obtained from texts of very different sizes. Then we will extend the paradigm of clustering from word-based clustering to compound-based clustering. In the above examples we looked only at the mutual substitutability of words; however, a lot of information can also be gained if we look at the substitutability of word compounds for either other word compounds or single words. We will introduce the notion of compound-classes, propose a method for constructing them, and present results of our approach.</Paragraph> </Section> class="xml-element"></Paper>