File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-0806_evalu.xml
Size: 4,371 bytes
Last Modified: 2025-10-06 13:59:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0806"> <Title>Senseval-3: The Spanish Lexical Sample Task</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Results and System Comparison </SectionTitle> <Paragraph position="0"> Table 2 presents the global results of all participant systems, including the MFC baseline (most frequent sense classifier) and sorted by the combined Fa0 measure. The COMB row stands for a voted combination of the best systems (see the last part of the section). As it can be seen, IRST and UA-SRT are the best performing systems, with no significant differences between them5.</Paragraph> <Paragraph position="1"> All supervised systems outperformed the MFC baseline, with a best overall improvement of 16.48 points (51.05% relative error reduction)6. Both unsupervised systems performed below MFC.</Paragraph> <Paragraph position="2"> It is also observed that the POS and lemma information used by most supervised systems is relevant, since Duluth-SLSS (based solely on raw lexical information) performed significantly worse than the rest of supervised systems7.</Paragraph> <Paragraph position="3"> Detailed results by groups of words are showed in table 3. Word groups include part-of-speech, intervals of the proportion of the most frequent sense (%MFS), intervals of the ratio number of examples per sense (ExS), and the words in the retraining set used by UA-SRT (those with a MFC accuracy lower than 70% in the training set). Each cell contains precision and recall. Bold face results correspond to the best system in terms of the Fa0 score. Last column, a7 -error, contains the best Fa0 improvement over the baseline: absolute difference and error reduction (%).</Paragraph> <Paragraph position="4"> As in many other previous WSD works, verbs are the most difficult words (13.07 improvement and 46.7% error reduction), followed by adjectives (19.64, 52.1%), and nouns (20.78, 59.4%). The gain obtained by all methods on words with high MFC (more than 90%) is really low, indicating the difficulties of supervised ML algorithms at acquiring information about non-frequent senses). On the contrary, the gain obtained on the lowest MFC words is really good (44.3 points and 62.5% error reduction). This is a very good property of the Spanish dataset and the participant systems, which is not always observed in other empirical studies using other WSD corpora (e.g., in the Senseval-2 Spanish task values of 29.9 and 43.1% were observed). The two unsupervised systems failed at achieving a performance on nouns comparable to the baseline classifier. UA-NP has the best precision but at a cost of an extremely low recall (below 5%).</Paragraph> <Paragraph position="5"> It is also observed that participant systems are quite different along word groups, being the best performances shared between IRST, UA-SRT, UMD, and UNED systems. Interestingly, IRST is the best system addressing the words with less examples per sense, suggesting that SVM is a good learning algorithm for training on small datasets, but loses this advantage for the words with more ble 3 shows a non-regular behavior with abnormal low results on some groups of words.</Paragraph> <Paragraph position="6"> examples. These facts opens the avenue for further improvements on the Spanish dataset by combining the outputs of the best performing systems. As a first approach, we conducted some simple experiments on system combination by considering a voting scheme, in which each system votes and the majority sense is selected (ties are decided favoring the best method prediction). From all possible sets, the best combination includes the five systems with the best precision figures: UA-NP, IRST, UMD, UNED, and SWAT. The resulting Fa0 measure is 85.98, 1.78 points higher than the best single system (see table 2). This improvement comes mainly from the better Fa0 performance on nouns: from 83.89 to 87.28.</Paragraph> <Paragraph position="7"> We also calculated the agreement rate and the Kappa statistic between each pair of systems. The agreement ratios ranged from 40.93% to 88.10%, and the Kappa values from 0.40 to 0.87. It is worth noting that the system relying on the simplest feature set (Duluth-SLSS) obtained the most similar output to the most frequent sense classifier.</Paragraph> </Section> class="xml-element"></Paper>