File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0416_intro.xml
Size: 4,402 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0416"> <Title>An Efficient Clustering Algorithm for Class-based Language Models</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Clustering algorithms have been extensively studied in the research area of natural language processing because many researchers have proved that &quot;classes&quot; obtained by clustering can improve the performance of various NLP tasks. Examples have been class-based D2-gram models (Brown et al., 1992; Kneser and Ney, 1993), smoothing techniques for structural disambiguation (Li and Abe, 1998) and word sense disambiguation (Sh&quot;utze, 1998).</Paragraph> <Paragraph position="1"> In this paper, we define a general form for class-based probabilistic language models, and propose an efficient and model-theoretic algorithm for clustering based on this. The algorithm involves three operations, CLAS-SIFY, MERGE, and SPLIT, all of which decreases the optimization function based on the MDL principle (Rissanen, 1984), and can efficiently find a point near the local optimum. The algorithm is applicable to more general tasks than existing studies (Li and Abe, 1998; Berkhin and Becher, 2002), and computational costs are significantly small, which allows its application to very large corpora.</Paragraph> <Paragraph position="2"> Clustering algorithms may be classified into three types. The first is a type that uses various heuristic measure of similarity between the elements to be clustered and has no interpretation as a probability model (Widdow, 2002). The resulting clusters from this type of method are not guaranteed to work effectively as a component of a statistical language model, because the similarity used in clustering is not derived from the criterion in the learning process of the statistical model, e.g. likelihood.</Paragraph> <Paragraph position="3"> The second type has clear interpretation as a probability model, but no criteria to determine the number of clusters (Brown et al., 1992; Kneser and Ney, 1993). The performance of methods of this type depend on the number of clusters that must be specified before the clustering process. It may prove rather troublesome to determine the proper number of clusters in this type of method. The third has interpretation as a probability model and uses some statistically motivated model selection criteria to determine the proper number of clusters. This type has a clear advantage compared to the second. AutoClass (Cheeseman and Stutz, 1996), the Bayesian model merging method (Stolcke and Omohundro, 1996) and Li's method (Li, 2002) are examples of this type. AutoClass and the Bayesian model merging are based on soft clustering models and Li's method is based on a hard clustering model. In general, computational costs for hard clustering models are lower than that for soft clustering models. However, the time complexity of Li's method is of cubic order in the size of the vocabulary. Therefore, it is not practical to apply it to large corpora.</Paragraph> <Paragraph position="4"> Our model and clustering algorithm provide a solution to these problems with existing clustering algorithms.</Paragraph> <Paragraph position="5"> Since the model has clear interpretation as a probability model, the clustering algorithm uses MDL as clustering criteria and using a combination of top-down clustering, bottom-up clustering, and a K-means style exchange algorithm, the method we propose can perform the clustering efficiently.</Paragraph> <Paragraph position="6"> We evaluated the algorithm through experiments on a disambiguation task of Japanese dependency analysis.</Paragraph> <Paragraph position="7"> In the experiments, we observed that the proposed algorithm's computation time is roughly linear to the size of the vocabulary, and it performed slightly better than the existing method. Our main intention in the experiments was to see improvements in terms of computational cost, not in performance in the test task. We will show, in Sections 2 and 3, that the proposed method can be applied to a broader range of tasks than the test task we evaluate in the experiments in Section 4. We need further experiments to determine the performance of the proposed method with more general tasks.</Paragraph> </Section> class="xml-element"></Paper>