File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0416_evalu.xml
Size: 8,838 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0416"> <Title>An Efficient Clustering Algorithm for Class-based Language Models</Title> <Section position="10" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> This section discusses the results of the evaluation experiment where we compared three clustering methods: i.e., our method, Li's agglomerative method described in Li (2002), and a restricted version of our method that only uses CLASSIFY.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Evaluation task </SectionTitle> <Paragraph position="0"> We used a simplified version of the dependency analysis task for Japanese for the evaluation experiment.</Paragraph> <Paragraph position="1"> In Japanese, a sentence can be thought of as an array of phrasal units called 'bunsetsu' and the dependency structure of a sentence can be represented by the relationships between these bunsetsus. A bunsetsu consists of one or more content words and zero or more function words that follow these.</Paragraph> <Paragraph position="2"> For example, the Japanese sentence Ryoushi-ga kawa-de oyogu nezumi-wo utta.</Paragraph> <Paragraph position="3"> hunter-SUBJ river-in swim mouse-OBJ shot (A hunter shot a mouse which swam in the river.) contains five bunsetsus CU Ryoushi-ga, kawa-de, oyogu, nezumi-wo, utta CV and their dependency relations are as follows: Ryoushi-ga AX utta kawa-de AX oyogu oyogu AX nezumi-wo nezumi-wo AX utta Our task is, given an input bunsetsu, to output the correct bunsetsu on which the input bunsetsu depends. In this task, we considered the dependency relations of limited types. That is the dependency of types: noun-pp AX pred , where noun is a noun, or the head of a compound noun, pp is one of 9 postpositions CUga, wo, ni, de, to, he, made, kara, yoriCV and pred is a bunsetsu which contains a verb or an adjective as its content word part. We restricted possible dependee bunsetsus to be those to the right of the input bunsetsus because in Japanese, basically all dependency relations are from left to right. Thus, our test data is in the form Our training data is in the form BOD6, noun, pp, predBQ. A sample of this form represents two bunsetsus, noun-pp and pred within a sentence, in this order, and D6 BE CUB7BNA0CVdenotes whether they are in a dependency relation (D6 BPB7), or not (D6 BP A0). From these types of samples, we want to estimate probabilityC8B4D6BNnounBNppBNpredB5 and use these to approximate probability D4 We approximated the probability of occurrence for sample type D6 BP A0 expressed as C8B4A0BNnounBNppBNpredB5BPC8B4A0BNnounB5C8B4A0BNppBNpredB5BN and estimated these from the raw frequencies. For the probability of type D6 BP B7, we treated a pair of pp and pred as one variable, pp:pred, expressed as We extracted the training samples and the test data from the EDR Japanese corpus (EDR, 1994). We extracted all the positive (i.e., D6 BPB7) and negative (D6 BP A0) relation samples and divided them into 10 disjunctive sets for 10-fold cross validation. When we divided the samples, all the relations extracted from one sentence were put together in one of 10 sets. When a set was used as the test data, these relations from one sentence were used as the test data of the form (Eq.8). Of course, we did not use samples with only one pred. In the results in the next subsection, the 'training data of size D7' means where we used a subset of positive samples that were covered by the most frequent D7 nouns and the most frequent D7 pp:pred pairs.</Paragraph> </Section> </Section> <Section position="11" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> In this experiments, we compared three methods: ours, Li's described in Li (2002), and a restricted version of our method that only uses CLASSIFY operations. The last method is simply called 'the CLASSIFY method' in this subsection. We used 10 as parameter C2 in our method, which specifies the number of trials in initialization and each SPLIT operation. Li's method (2002) uses the MDL principle as clustering criteria and creates word classes in a bottom-up fashion. Parameters CQ</Paragraph> <Paragraph position="2"> in his method, which specify the maximum numbers of successive merges in each dimension, were both set to 100. The CLASSIFY method performs K-means style iterative clustering and requires that the number of clusters be specified beforehand. We set these to be the same as the number of clusters created by our method in each training set. By evaluating the differences in the performance of ours and the CLASSIFY method, we can see advantages in our top-down approach guided by the MDL principle, compared to the K-means style approach that uses a fixed number of clusters.We expect that these advantages will remain when compared to other previously reported, K-means style methods (Kneser and Ney, 1993; Berkhin and Becher, 2002; Dhillon et al., 2002).</Paragraph> <Paragraph position="3"> In the results, precision refers to the ratio CRBPB4CR B7 DBB5 and coverage refers to the ratio CRBPD8, where CR and DB denote the numbers of correct and wrong predictions, and D8 denotes the number of all test data. All the 'ties cases' were treated as wrong answers (DB), where a 'tie case' means a situation where two or more predictions are made with the same maximum probabilities.</Paragraph> <Paragraph position="4"> All digits are averages of results for ten training-test pairs, except for Li's method where the training sets were 8k or more. The results of the Li's method on training set of 8k were the averages over two training-test pairs. We could not do more trials with Li's method due to time constraints. All experiments were done on Pentium III 1.2-GHz computers and the reported computation times are wall-clock times.</Paragraph> <Paragraph position="5"> Figure 1 shows the computation time as a function of the size of the vocabulary, i.e., the number of nouns plus the number of case frame slots (i.e., pp:pred) in the training data. We can clearly see the efficiency of our method in the plot, compared to Li's method. The log-log plot reveals our time complexity is roughly linear to the size of the vocabulary in these data sets. This is about two orders lower than that for Li's method.</Paragraph> <Paragraph position="6"> There is little relevance in comparing the speed of the CLASSIFY method to the speed of the other two methods, because its computation time does not include the time required to decide the proper number of classes. Of more interest is to see its seeming speed-up in the largest data sets. This implies that, in large and sparse training data, the CLASSIFY method was caught in some bad local optima at some early points on the way to better local optima.</Paragraph> <Paragraph position="7"> Figure 2 has the computation times as a function of the coverage which is achieved using that computation time. From this, we would expect our method to reach higher coverage within a realistic time if we used larger quantities of training data. To determine this, we need other experiments using larger corpora, which we intend to do in the future.</Paragraph> <Paragraph position="8"> Table 1 lists the description lengths for training data from 1 to 32k and Table 2 shows the precision and coverage achieved by each method with this data. In these tables, we can see that our method works slightly better than Li's method as an optimization method which minimizes the description length, and also in the evaluation tasks. Therefore, we can say that our method decreased computational costs without losing accuracy. We can also see that ours always performs better than the CLASSIFY method. Both ours and the CLASSIFY method use random initializations, but from the results, it seems that our top-down, divisive strategy in combination with K-means like swapping and merging operations avoids the poor local optima where the CLASSIFY method was caught.</Paragraph> <Paragraph position="9"> Figure 3 also presents the results in terms of coverageprecision trade-off. We can see that our method selected always better points in the trade-off than Li's method or the CLASSIFY method.</Paragraph> <Paragraph position="10"> From these results, we can conclude that our clustering algorithm is more efficient and yields slightly better results than Li's method, which uses the same clustering criterion. We can also expect that our combined apsize of test data 1k 2k 3k 4k 5k 8k 16k 32k proach with the MDL principle will have advantages in large and sparse data compared to existing K-means style approaches where the number of the clusters is fixed.</Paragraph> </Section> class="xml-element"></Paper>