File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3814_metho.xml
Size: 9,876 bytes
Last Modified: 2025-10-06 14:11:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3814"> <Title>Evaluating and optimizing the parameters of an unsupervised graph-based WSD algorithm</Title> <Section position="5" start_page="90" end_page="91" type="metho"> <SectionTitle> 3 Evaluating unsupervised WSD systems </SectionTitle> <Paragraph position="0"> All unsupervised WSD algorithms need some addition in order to be evaluated. One alternative, as in (V'eronis, 2004), is to manually decide the correctness of the hubs assigned to each occurrence of the words. This approach has two main disadvantages.</Paragraph> <Paragraph position="1"> First, it is expensive to manually verify each occurrence of the word, and different runs of the algorithm need to be evaluated in turn. Second, it is not an easy task to manually decide if an occurrence of a word effectively corresponds with the use of the word the assigned hub refers to, especially considering that the person is given a short list of words linked to the hub. We also think that instead of judging whether the hub returned by the algorithm is correct, the person should have independently tagged the occurrence with hubs, which should have been then compared to the hub returned by the system.</Paragraph> <Paragraph position="2"> A second alternative is to evaluate the system according to some performance in an application, e.g.</Paragraph> <Paragraph position="3"> information retrieval (Sch&quot;utze, 1998). This is a very attractive idea, but requires expensive system development and it is sometimes difficult to separate the reasons for the good (or bad) performance.</Paragraph> <Paragraph position="4"> A third alternative would be to devise a method to map the hubs (clusters) returned by the system to the senses in a lexicon. Pantel and Lin (2002) automatically map the senses to WordNet, and then measure the quality of the mapping. More recently, the mapping has been used to test the system on publicly available benchmarks (Purandare and Ped- null more details on these systems.</Paragraph> <Paragraph position="5"> Yet another possibility is to evaluate the induced senses against a gold standard as a clustering task.</Paragraph> <Paragraph position="6"> Induced senses are clusters, gold standard senses are classes, and measures from the clustering literature like entropy or purity can be used. As we wanted to focus on the comparison against a standard data-set, we decided to leave aside this otherwise interesting option.</Paragraph> <Paragraph position="7"> In this section we present a framework for automatically evaluating unsupervised WSD systems against publicly available hand-tagged corpora. The framework uses three data sets, called Base corpus, Mapping corpus and Test corpus: * The Base Corpus: a collection of examples of the target word. The corpus is not annotated.</Paragraph> <Paragraph position="8"> * The Mapping Corpus: a collection of examples of the target word, where each corpus has been manually annotated with its sense.</Paragraph> <Paragraph position="9"> * The Test Corpus: a separate collection, also annotated with senses.</Paragraph> <Paragraph position="10"> The evaluation framework is depicted in Figure 1.</Paragraph> <Paragraph position="11"> The first step is to execute the HyperLex algorithm over the Base corpus in order to obtain the hubs of a target word, and the generated MST is stored. As stated before, the Base Corpus is not tagged, so the building of the MST is completely unsupervised.</Paragraph> <Paragraph position="12"> In a second step (left part in Figure 1), we assign a hub score vector to each of the occurrences of target word in the Mapping corpus, using the MST calculated in the previous step (following the WSD algorithm in Section 2.2). Using the hand-annotated sense information, we can compute a mapping matrix M that relates hubs and senses in the following way. Suppose there are m hubs and n senses for the target word. Then, M = {mij} 1 [?] i [?] m,1 [?] j [?] n, and each mij = P(sj|hi), that is, mij is the probability of a word having sense j given that it has been assigned hub i. This probability can be computed counting the times an occurrence with sense sj has been assigned hub hi.</Paragraph> <Paragraph position="13"> This mapping matrix will be used to transform any hub score vector -h = (h1,...,hm) returned by the WSD algorithm into a sense score vector -s = (s1,...,sn). It suffices to multiply the score vector by M, i.e., -s = -hM.</Paragraph> <Paragraph position="14"> In the last step (right part in Figure 1), we apply the WSD algorithm over the Test corpus, using again the MST generated in the first step, and returning a hub score vector for each occurrence of the target word in the test corpus. We then run the Evaluator, which uses the M mapping matrix in order to convert the hub score vector into a sense score vector. The Evaluator then compares the sense with highest weight in the sense score vector to the sense that was manually assigned, and outputs the precision figures.</Paragraph> <Paragraph position="15"> Preliminary experiments showed that, similar to other unsupervised systems, HyperLex performs better if it sees the test examples when building the graph. We therefore decided to include a copy of the training and test corpora in the base corpus (discarding all hand-tagged sense information, of course). Given the high efficiency of the algorithm this poses no practical problem (see efficiency figures in Section 6).</Paragraph> </Section> <Section position="6" start_page="91" end_page="91" type="metho"> <SectionTitle> 4 Tuning the parameters </SectionTitle> <Paragraph position="0"> As stated before, the behavior of the HyperLex algorithm is influenced by a set of heuristic parameters, that affect the way the cooccurrence graph is built, the number of induced hubs, and the way they are extracted from the graph. There are 7 parameters in in the train and test splits. The MFS column corresponds to the most frequent sense. The rest of columns correspond to different parameter settings: default for the default setting, p180 for the best combination over 180, etc.. The last rows show the micro-average over the S3LS run, and we also add the results on the S2LS dataset (different sets of nouns) to confirm that the same trends hold in both datasets.</Paragraph> <Paragraph position="1"> p5 Minimum number of adjacent vertices a hub must have p6 Max. mean weight of the adjacent vertices of a hub p7 Minimum frequency of hubs Table 1 lists the parameters of the HyperLex algorithm, and the default values proposed for them in the original work (second column).</Paragraph> <Paragraph position="2"> Given that we have devised a method to efficiently evaluate the performance of HyperLex, we are able to tune the parameters against the gold standard. We first set a range for each of the parameters, and evaluated the algorithm for each combination of the parameters on a collection of examples of different words (Senseval 2 English lexical-sample, S2LS).</Paragraph> <Paragraph position="3"> This ensures that the chosen parameter set is valid for any noun, and is not overfitted to a small set of nouns.6 The set of parameters that obtained the best results in the S2LS run is then selected to be run against the S3LS dataset.</Paragraph> <Paragraph position="4"> We first devised ranges for parameters amounting to 180 possible combinations (p180 column in Table 2), and then extended the ranges to amount to parameters for each word did not yield better results.</Paragraph> </Section> <Section position="7" start_page="91" end_page="93" type="metho"> <SectionTitle> 5 Experiment setting and results </SectionTitle> <Paragraph position="0"> To evaluate the HyperLex algorithm in a standard benchmark, we applied it to the 20 nouns in S3LS.</Paragraph> <Paragraph position="1"> We use the standard training-test split. Following the design in Section 3, we used both the training and test sets as the Base Corpus (ignoring the sense tags, of course). The Mapping Corpus comprised the training split only, and the Test corpus the test split only. The parameter tuning was done in a similar fashion, but on the S2LS dataset.</Paragraph> <Paragraph position="2"> In Table 2 we can see the number of examples of each word in the different corpus and the results of the algorithm. We indicate only precision, as the coverage is 100% in all cases. The left column, named MFS, shows the precision when always assigning the most frequent sense (relative to the train split). This is the baseline of our algorithm as our algorithm does see the tags in the mapping step (see Section 6 for further comments on this issue).</Paragraph> <Paragraph position="3"> The default column shows the results for the HyperLex algorithm with the default parameters as set by V'eronis, except for the minimum frequency of the vertices (p2 in Table 1), which according to some preliminary experiments we set to 3. As we can see, the algorithm with the default settings outperforms combinations. The horizontal axis shows the similarity of a parameter set w.r.t. the best parameter set using the cosine. The vertical axis shows the precision in S2LS. The best fitting line is also depicted.</Paragraph> <Paragraph position="4"> the MFS baseline by 5.4 points average, and in almost all words (except plan, sort and source).</Paragraph> <Paragraph position="5"> The results for the best of 180 combinations of the parameters improve the default setting (0.4 overall), Extending the parameter space to 1800 and 6700 improves the precision up to 63.0 and 64.6, 10.1 over the MFS (MFS only outperforms HyperLex in the best setting for two words). The same trend can be seen on the S2LS dataset, where the gain was more modest (note that the parameters were optimized for S2LS).</Paragraph> </Section> class="xml-element"></Paper>