File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3814_evalu.xml
Size: 8,792 bytes
Last Modified: 2025-10-06 13:59:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3814"> <Title>Evaluating and optimizing the parameters of an unsupervised graph-based WSD algorithm</Title> <Section position="8" start_page="93" end_page="95" type="evalu"> <SectionTitle> 6 Discussion and related work </SectionTitle> <Paragraph position="0"> We first comment the results, doing some analysis, and then compare our results to those of V'eronis. Finally we overview some relevant work and review the results of unsupervised systems on the S3LS benchmark.</Paragraph> <Section position="1" start_page="93" end_page="93" type="sub_section"> <SectionTitle> 6.1 Comments on the results </SectionTitle> <Paragraph position="0"> The results show clearly that our exploration of the parameter space was successful, with the widest parameter space showing the best results.</Paragraph> <Paragraph position="1"> In order to analyze whether the search in the parameter space was making any sense, we drew a dispersion plot (see Figure 2). In the top right-hand corner we have the point corresponding to the best performing parameter set. If the parameters were not conditioning the good results, then we would have expected a random cloud of points. On the contrary, we can see that there is a clear tendency for those standard deviation) for three parameter settings. Defined means the number of hubs induced, and used means the ones actually returned by HyperLex when disambiguating the test set. The same applies for senses, that is, defined means total number of senses (equal for all columns), and used means the senses that were actually used by HyperLex in the test set. The last row shows the actual number of senses used by the hand-annotators in the test set.</Paragraph> <Paragraph position="2"> parameter sets most similar to the best one to obtain better results, and in fact the best fitting line shows a clearly ascending slope.</Paragraph> <Paragraph position="3"> Regarding efficiency, our implementation of HyperLex is extremely fast. Doing the 1800 combinations takes 2 hours in a 2 AMD Opteron processors at 2GHz and 3Gb RAM. A single run (building the MST, mapping and tagging the test sentences) takes only 16 sec. For this reason, even if an on-line version would be in principle desirable, we think that this batch version is readily usable.</Paragraph> </Section> <Section position="2" start_page="93" end_page="94" type="sub_section"> <SectionTitle> 6.2 Comparison to (V'eronis, 2004) </SectionTitle> <Paragraph position="0"> Compared to V'eronis we are inducing larger numbers of hubs (with different parameters), using less examples to build the graphs and obtaining more modest results (far from the 90's). Regarding the latter, our results are in the range of other S3LS WSD systems (see below), and the discrepancy can be explained by the way V'eronis performed his evaluation (see Section 3).</Paragraph> <Paragraph position="1"> Table 3 shows the average number of hubs for the four parameter settings. The average number of hubs for the default setting is larger than that of V'eronis (which ranges between 4 and 9 per word), but quite close to the average number of senses. The exploration of the parameter space prefers parameter settings with even larger number of hubs, and the figures shows that most of them are actually used for disambiguation. The table also shows that, after the mapping, less than half of the senses are actually used, which seems to indicate that the mapping tends to favor the most frequent senses.</Paragraph> <Paragraph position="2"> Regarding the actual values of the parameters used (c.f. Table 1), we had to reduce the value of some parameters (e.g. the minimum frequency of vertices) due to the smaller number of of examples (V'eronis used from 1900 to 8700 examples per word). In theory, we could explore larger parameter spaces, but Table 1 shoes that the best setting for the 6700 combinations has no parameter in a range boundary (except p5, which cannot be further reduced). null All in all, the best results are attained with smaller and more numerous hubs, a kind of micro-senses.</Paragraph> <Paragraph position="3"> A possible explanation for this discrepancy with V'eronis could be that he was inspecting by hand the hubs that he got, and perhaps was biased by the fact that he wanted the hubs to look more like standard senses. At first we were uncomfortable with this behavior, so we checked whether HyperLex was degenerating into a trivial solution. We simulated a clustering algorithm returning one hub per example, and its precision was 40.1, well below the MFS baseline. We also realized that our results are in accordance with some theories of word meaning, e.g. the &quot;indefinitely large set of prototypes-withinprototypes&quot; envisioned in (Cruse, 2000). We now think that the idea of having many micro-senses is very attractive for further exploration, especially if we are able to organize them into coarser hubs.</Paragraph> </Section> <Section position="3" start_page="94" end_page="95" type="sub_section"> <SectionTitle> 6.3 Comparison to related work </SectionTitle> <Paragraph position="0"> Table 4 shows the performance of different systems on the nouns of the S3LS benchmark. When not reported separately, we obtained the results for nouns running the official scorer program on the filtered results, as available in the S3LS web page. The second column shows the type of system (supervised, unsupervised).</Paragraph> <Paragraph position="1"> We include three supervised systems, the winner of S3LS (Mihalcea et al., 2004), an in-house system (kNN-all, CITATION OMITTED) which uses optimized kNN, and the same in-house system restricted to bag-of-words features only (kNN-bow), i.e. discarding other local features like bigrams or trigrams (which is what most unsupervised systems do). The table shows that we are one point from the bag-of-words classifier kNN-bow, which is an impressive result if we take into account the information loss of the mapping step and that we tuned our parameters on a different set of words. The full kNN system is state-of-the-art, only 4 points below the S3LS win- null all of which except Cymfony and (Purandare and Pedersen, 2004) participated in S3LS (check (Mihalcea et al., 2004) for further details on the systems). We classify them according to the amount of &quot;supervision&quot; they have: some have have access to most-frequent information (MFS-S3 if counted over S3LS, MFS-Sc if counted over SemCor), some use 10% of the S3LS training part for mapping (10%-S3LS), and some use the full amount of S3LS training for mapping (S3LS). Only one system (Duluth) did not use in any way hand-tagged corpora.</Paragraph> <Paragraph position="2"> Given the different typology of unsupervised systems, it's unfair to draw definitive conclusions from a raw comparison of results. The system coming closer to ours is that described in (Niu et al., 2005). They use hand tagged corpora which does not need to include the target word to tune the parameters of a rather complex clustering method which does use local information (an exception to the rule of unsupervised systems). They do use the S3LS training corpus for mapping. For every sense the target word, three of its contexts in the train corpus are gathered (around 10% of the training data) and tagged. Each cluster is then related with its most frequent sense.</Paragraph> <Paragraph position="3"> Only one cluster may be related to a specific sense, so if two or more clusters map to the same sense, only the largest of them is retained. The mapping method is similar to ours, but we use all the available training data and allow for different hubs to be assigned to the same sense.</Paragraph> <Paragraph position="4"> Another system similar to ours is (Purandare and Pedersen, 2004), which unfortunately was evaluated on Senseval 2 data. The authors use first and second order bag-of-word context features to represent each instance of the corpus. They apply several clustering algorithms based on the vector space model, limiting the number of clusters to 7. They also use all available training data for mapping, but given their small number of clusters they opt for a one-to-one mapping which maximizes the assignment and discards the less frequent clusters. They also discard some difficult cases, like senses and words with low frequencies (10% of total occurrences and 90, respectively). The different test set and mapping system make the comparison difficult, but the fact that the best of their combinations beats MFS by 1 point on average (47.6% vs. 46.4%) for the selected nouns and senses make us think that our results are more robust (nearly 10% over MFS).</Paragraph> </Section> </Section> class="xml-element"></Paper>