File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-3004_evalu.xml
Size: 4,930 bytes
Last Modified: 2025-10-06 13:59:26
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3004"> <Title>Chinese Classifier Assignment Using SVMs</Title> <Section position="7" start_page="27" end_page="29" type="evalu"> <SectionTitle> 6 Results and Discussion </SectionTitle> <Paragraph position="0"> We built SVMs using all the feature sets described in Section 5 and tested using 10-fold cross validation. We tried the four types of kernel function in LIBSVM: linear, polynomial, radial basis function (RBF) and sigmoid, then selected the RBF kernal K(x,y) = e[?]g||x[?]y||2, which gives the (1) noun only 57.81% (c = 4, g = 0.5) 59.34% (c = 16, g = 0.125) (2) ontology only 58.69% (c = 4, g = 0.5) 60.68% (c = 256, g = 0.125) (3) noun and ontology 57.81% (c = 16, g = 0.5) 59.46% (c = 16, g = 0.125) (4) noun or ontology 58.71% 60.55% (5) noun, syntactic and lexical features 52.14% (c = 1024, g = 0.5) 53.51% (c = 16, g = 0.5) (6) all features 52.06% (c = 1024, g = 0.075) 53.55% (c = 16, g = 0.5) value misclassified as column value and (percentage of total misclassifications of row value misclassified as column value) highest accuracy. For each feature set, we systematically varied the values for the parameters C (range from 2[?]5 to 215) and g (range from 23 to 2[?]15); we report the best results with corresponding values for C and g. Finally, for each feature set, we ran once on all nouns and once only on nouns occurring twice or more in the corpus.</Paragraph> <Paragraph position="1"> Classifier assignment accuracy is reported in nificantly better than baseline (paired t-test, p < 0.005). There is no significant difference between the performance with the 1st, 2nd, 3rd and 4th feature sets. But the performance of the SVMs using lexical and syntactic features (experiments 5 and 6) is significantly worse than the performance on feature sets 1-4 (df = 17.426, p < 0.05).</Paragraph> <Paragraph position="2"> These results show that lexical and syntactic contextual features do not have a positive effect on the assignment of classifiers. They confirm the intuition that the noun is the single most important predictor of the classifier; however, the semantic class of the noun works as well as the noun itself.</Paragraph> <Paragraph position="3"> In addition, a combination approach that uses semantic class information when the noun is previously unseen does not perform better.</Paragraph> <Paragraph position="4"> We also computed the confusion matrix for the most commonly misclassified classifiers. The results are reported in Table 4.</Paragraph> <Paragraph position="5"> For these experiments we used automatic evaluation (cf. (Paul et al., 2002)). A classifier is only judged to be correct if it is exactly the same as that in the original test set. For some noun phrases, there are multiple valid classifiers. For example, we can say</Paragraph> <Paragraph position="7"> (a golden medal).</Paragraph> <Paragraph position="8"> We did a subjective evaluation on part of our data to evaluate how many automatically generated classifiers are acceptable to human readers. We randomly selected 241 noun-classifier pairs from our data. We presented the sentence containing each pair to a human judge who is a native speaker of Mandarin Chinese. We asked the judge to rate all the classifiers generated by our (1) noun only 224 92.9% 1.76 (2) ontology only 226 93.8% 1.78 (3) noun and ontology 226 93.8% 1.77 (4) noun or ontology 227 94.2% 1.80 (5) noun, syntactic and algorithms as well as the original classifier by indicating whether each is good (2), acceptable (1) or bad (0) in that sentence context. The classifiers were presented in random order; the judge was blind to the source of the classifiers.</Paragraph> <Paragraph position="9"> The results for our human evaluation are reported in Table 5. Although our automatic evaluation indicates relatively poor accuracy, 94.2% of generated classifiers using feature set 4) are rated acceptable or good in our subjective evaluation. Also, the performance of SVMs with the 1st, 2nd, 3rd and 4th feature sets is significantly better than baseline (paired t-test, p < 0.005). There is no significant difference between the performance with the 1st, 2nd, 3rd and 4th feature sets. But the performance of the SVMs using lexical and syntactic features (experiments 5 and 6) is significantly worse than those without (p < 0.05). The ratings of the classifiers generated by all our algorithms are significantly worse than the original classifiers in the corpus. In future work, we plan to extend this evaluation using more judges. Which classifier to select also depends on the emotional background of the discourse (Fang, 2003). For example, we can use different classfiers to express different affect for the same noun (e.g. if a government official is in favor or disgrace). However, we cannot get this kind of information from our corpus.</Paragraph> </Section> class="xml-element"></Paper>