File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0717_evalu.xml
Size: 4,429 bytes
Last Modified: 2025-10-06 13:58:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0717"> <Title>Inducing Syntactic Categories by Context Distribution Clustering</Title> <Section position="7" start_page="91" end_page="92" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> I used 12 million words of the British National Corpus as training data, and ran this algorithm with various numbers of clusters (77, 100 and 150). All of the results in this paper are produced with 77 clusters corresponding to the number of tags in the CLAWS tagset used to tag the BNC, plus a distinguished sentence boundary token. In each case, the clusters induced contained accurate classes corresponding to the major syntactic categories, and various subgroups of them such as prepositional verbs, first names, last names and so on. Appendix A shows the five most frequent words in a clustering with 77 clusters. In general, as can be seen, the clusters correspond to traditional syntactic classes. There are a few errors - notably, the right bracket is classified with adverbial particles like &quot;UP&quot;.</Paragraph> <Paragraph position="1"> For each word w, I then calculated the optimal coefficents c~ w). Table 1 shows some sample ambiguous words, together with the clusters with largest values of c~ i. Each cluster is represented by the most frequent member of the cluster. Note that &quot;US&quot; is a proper noun cluster. As there is more than one common noun cluster, for many unambiguous nouns the optimum is a mixture of the various classes.</Paragraph> <Paragraph position="2"> with tags NN1 (common noun) and A J0 (adjective). null Table 2 shows the accuracy of cluster assignment for rare words. For two CLAWS tags, A J0 (adjective) and NNl(singular common noun) that occur frequently among rare words in the corpus, I selected all of the words that occurred n times in the corpus, and at least half the time had that CLAWS tag. I then tested the accuracy of my assignment algorithm by marking it as correct if it assigned the word to a 'plausible' cluster - for A J0, either of the clusters &quot;NEW&quot; or &quot;IMPORTANT&quot;, and for NN1, one of the clusters &quot;TIME&quot;, &quot;PEOPLE&quot;, &quot;WORLD&quot;, &quot;GROUP&quot; or &quot;FACT&quot;. I did this for n in {1, 2, 3, 5, 10, 20}. I proceeded similarly for the Brown clustering algorithm, selecting two clusters for NN1 and four for A J0. This can only be approximate, since the choice of acceptable clusters is rather arbitrary, and the BNC tags are not perfectly accurate, but the results are quite clear; for words that occur 5 times or less the CDC algorithm is clearly more accurate.</Paragraph> <Paragraph position="3"> Evaluation is in general difficult with unsupervised learning algorithms. Previous authors have relied on both informal evaluations of the plausibility of the classes produced, and more formal statistical methods. Comparison against existing tag-sets is not meaningful - one set of tags chosen by linguists would score very badly against another without this implying any fault as there is no 'gold standard'. I therefore chose to use an objective statistical measure, the perplexity of a very simple finite state model, to compare the tags generated with this clustering technique against the BNC tags, which uses the CLAWS-4 tag set (Leech et al., 1994) which had 76 tags. I tagged 12 million words of BNC text with the 77 tags, assigning each word to the cluster with the highest a posteriori probability given its prior cluster distribution and its context.</Paragraph> <Paragraph position="4"> I then trained 2nd-order Markov models (equivalently class trigram models) on the original BNC tags, on the outputs from my algorithm (CDC), and for comparision on the output from the Brown algorithm. The perplexities on held-out data are shown in table 3. As can be seen, the perplexity is lower with the model trained on data tagged with the new algorithm.</Paragraph> <Paragraph position="5"> This does not imply that the new tagset is better; it merely shows that it is capturing statistical significant generalisations. In absolute terms the perplexities are rather high; I deliberately chose a rather crude model without backing off and only the minimum amount of smoothing, which I felt might sharpen the contrast.</Paragraph> </Section> class="xml-element"></Paper>