XML Viewer - w04-0836

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-0836_evalu.xml
Size: 4,918 bytes
Last Modified: 2025-10-06 13:59:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0836">
  <Title>Senseval-3: The Catalan Lexical Sample Task</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Results and System Comparison
</SectionTitle>
    <Paragraph position="0"> Table 2 presents the global results of all participant systems, including the MFC baseline (most frequent sense classifier), sorted by the combined Fa0 measure. The COMB row stands for a voted combination of the best systems (see last part of the section for a description). As in the Spanish lexical sample task the IRST system is the best performing one. In this case it achieves a substantial improvement with respect to the second system (SWAT-AB)5.</Paragraph>
    <Paragraph position="1"> All systems obtained better results than the base-line MFC classifier, with a best overall improvement of 18.87 points (56.09% relative error reduction)6.</Paragraph>
    <Paragraph position="2"> For the multiple systems presented by SWAT, the combination of learning algorithms in the SWAT-CP and SWAT-CA did not help improving the accuracy of the basic AdaBoost-based system SWAT-AB. It is also observed that the POS and Lemma information used by most systems is relevant, since the system relying only on raw lexical information  Detailed results by groups of words are showed in table 3. Word groups include part-of-speech, intervals of the proportion of the most frequent sense (%MFS), and intervals of the ratio: number of examples per sense (ExS). Each cell contains precision and recall. Bold face results correspond to the best system in terms of the Fa0 score. Last column, a20 -error, contains the best Fa0 improvement over the baseline: absolute difference and error reduction(%). null As in many other previous WSD works, verbs are significantly more difficult (16.67 improvement and 49.3% error reduction) than nouns (23.46, 65.6%).</Paragraph>
    <Paragraph position="3"> The improvements obtained by all methods on words with high MFC (more than 90%) is generally low. This is not really surprising, since statistically-based supervised ML algorithms have difficulties at acquiring information about non-frequent senses.</Paragraph>
    <Paragraph position="4"> Notice, however, the remarkable 44.9% error reduction obtained by SWAT-AB, the best system on this subset. On the contrary, the gain obtained on the lowest MFC words is really good (34.2 points and 55.3% error reduction). This is a good prop-erty of the Catalan dataset and the participant systems, which is not always observed in other empirical studies using other WSD corpora. It is worth noting that even better results were observed in the Spanish lexical sample task.</Paragraph>
    <Paragraph position="5"> Systems are quite different along word groups: IRST is globally the best but not on the words with highest (between 80% and 100%) an lowest (less than 50%) MFC, in which SWAT-AB is better. UNED and UMD are also very competitive on nouns but overall results are penalized by the lower performance on adjectives (specially UNED) and verbs (specially UMD). Interestingly, IRST is the best system addressing the words with few examples per sense, suggesting that SVM is a good algorithm for training on small datasets, but loses this advantage for the words with more examples.</Paragraph>
    <Paragraph position="6"> All, these facts, open the avenue for further im- null provements on the Catalan dataset by combining the outputs of the best performing systems, or by performing a selection of the best at word level. As a first approach, we conducted some simple experiments on system combination by considering a voting scheme, in which each system votes and the majority sense is selected (ties are decided favoring the best method prediction). From all possible sets, the best combination of systems turned out to be: IRST, SWAT-AB, and UNED. The resulting Fa0 measure is 86.86, 1.63 points higher than the best single system (see table 2). This improvement comes mainly from the better Fa0 performance on noun and verb categories: from 87.63 to 90.11 and from 82.63 to 85.47, respectively.</Paragraph>
    <Paragraph position="7"> Finally, see the agreement rates and the Kappa statistic between each pair of systems in table 4. Due to space restrictions we have indexed the systems by numbers: 1=MFC, 2=UMD, 3=IRST, 4=UNED, 5=D-CLSS, 6=SWAT-AB, 7=SWAT-CP, and 8=SWAT-CA. The upper diagonal contains the agreement ratios varying from 70.13% to 96.01%, and the lower diagonal contains the corresponding Kappa values, ranging from 0.67 and 0.95. It is worth noting that the system relying on the simplest feature set (Duluth-CLSS) obtains the most similar output to the most frequent sense classifier, and that the combination-based systems SWAT-CP and SWAT-CA generate almost the same output.</Paragraph>
    <Paragraph position="8">  pair of systems</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML