File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1029_metho.xml
Size: 15,515 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1029"> <Title>Ensemble Methods for Automatic Thesaurus Extraction</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Individual Methods </SectionTitle> <Paragraph position="0"> The individual methods in these ensemble experiments are based on different extractors of contextual information. All the systems use the JACCARD similarity metric and TTEST weighting function that were found to be most effective for thesaurus extraction by Curran and Moens (2002a).</Paragraph> <Paragraph position="1"> The simplest and fastest contexts to extract are the word(s) surrounding each thesaurus term up to some xed distance. These window methods are labelled W(L1R1), where L1R1 indicates that window extends one word on either side of the target term.</Paragraph> <Paragraph position="2"> Methods marked with an asterisk, e.g. W(L1R1 ), do not record the word's position in the relation type. The more complex methods extract grammatical relations using shallow statistical tools or a broad coverage parser. We use the grammatical relations extracted from the parse trees of Lin's broad coverage principle-based parser, MINIPAR (Lin, 1998a) and Abney's cascaded nite-state parser, CASS (Abney, 1996). Finally, we have implemented our own relation extractor, based on Grefenstette's SEXTANT (Grefenstette, 1994), which we describe below as an example of the NLP system used to extract relations from the raw text.</Paragraph> <Paragraph position="3"> Processing begins with POS tagging and NP/VP chunking using a Nacurrency1 ve Bayes classi er trained on the Penn Treebank. Noun phrases separated by prepositions and conjunctions are then concatenated, and the relation attaching algorithm is run on the sentence. This involves four passes over the sen- null tence, associating each noun with the modi ers and verbs from the syntactic contexts they appear in: 1. nouns with pre-modi ers (left to right) 2. nouns with post-modi ers (right to left) 3. verbs with subjects/objects (right to left) 4. verbs with subjects/objects (left to right) This results in relations representing the contexts: 1. term is the subject of a verb 2. term is the (direct/indirect) object of a verb 3. term is modi ed by a noun or adjective 4. term is modi ed by a prepositional phrase The relation tuple is then converted to root form using the Sussex morphological analyser (Minnen et al., 2000) and the POS tags are stripped. The relations for each term are collected together producing a context vector of attributes and their frequencies in the corpus. Figure 1 shows the most strongly weighted attributes and their frequencies for idea.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Our experiments use a large quantity of text which we have grouped into a range of corpus sizes. The approximately 300 million word corpus is a random con ation of the BNC and the Reuters corpus (respective sizes in Table 1). We then create corpus subsets down to 1128 th (2.3 million words) of the original corpus by randomly sentence selection.</Paragraph> <Paragraph position="1"> Ensemble voting methods for this task are quite interesting because the result consists of an ordered set of extracted synonyms rather than a single class label. To test for subtle ranking effects we implemented three different methods of combination: MEAN mean rank of each term over the ensemble; HARMONIC the harmonic mean rank of each term; MIXTURE ranking based on the mean score for each term. The individual extractor scores are not normalised because each extractor uses the same similarity measure and weight function.</Paragraph> <Paragraph position="2"> We assigned a rank of 201 and similarity score of zero to terms that did not appear in the 200 synonyms returned by the individual extractors. Finally, we build ensembles from all the available extractor methods (e.g. MEAN( )) and the top three performing extractors (e.g. MEAN(3)).</Paragraph> <Paragraph position="3"> To measure the complementary disagreement between ensemble constituents we calculated both the complementarity C and the Spearman rank-order correlation Rs.</Paragraph> <Paragraph position="5"> (2) where r(x) is the rank of synonym x. The Spearman rank-order correlation coef cient is the linear correlation coef cient between the rankings of elements of A and B. Rs is a useful non-parametric comparison for when the rank order is more relevant than the actual values in the distribution.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> The evaluation is performed on thesaurus entries extracted for 70 single word noun terms. To avoid sample bias, the words were randomly selected from WordNet such that they covered a range of values for the following word properties: frequency Penn Treebank and BNC frequencies; number of senses WordNet and Macquarie senses; speci city depth in the WordNet hierarchy; concreteness distribution across WordNet subtrees.</Paragraph> <Paragraph position="1"> Table 2 shows some of the selected terms with frequency and synonym set information. For each term we extracted a thesaurus entry with 200 potential synonyms and their similarity scores.</Paragraph> <Paragraph position="2"> The simplest evaluation measure is direct comparison of the extracted thesaurus with a manually-created gold standard (Grefenstette, 1994). However, on smaller corpora direct matching is often too coarse-grained and thesaurus coverage is a problem. To help overcome limited coverage, our evaluation uses a combination of three electronic thesauri: the topic-ordered Macquarie (Bernard, 1990) and Roget's (Roget, 1911) thesauri and the head ordered Moby (Ward, 1996) thesaurus. Since the extracted thesaurus does not separate senses we transform Roget's and Macquarie into head ordered format by collapsing the sense sets containing the term. For the 70 terms we create a gold standard from the union of the synonym lists of the three thesauri, resulting in a total of 23,207 synonyms.</Paragraph> <Paragraph position="3"> With this gold standard resource in place, it is possible to use precision and recall measures to evaluate the quality of the extracted thesaurus. To help overcome the problems of coarse-grained direct comparisons we use several measures of system performance: direct matches (DIRECT), inverse rank (INVR), and top n synonyms precision (P(n)).</Paragraph> <Paragraph position="4"> INVR is the sum of the inverse rank of each matching synonym, e.g. gold standard matches at ranks 3, 5 and 28 give an inverse rank score of the maximum INVR score is 5:878. Top n precision is the percentage of matching synonyms in the top n extracted synonyms. We use n = 1; 5 and 10.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> Figure 2 shows the performance trends for the individual extractors on corpora ranging from 2.3 million up to 300 million words. The best individual context extractors are SEXTANT, MINIPAR and three extractors are combined to form the top-three ensemble. CASS and the other window methods perform signi cantly worse than SEXTANT and MINI-PAR. Interestingly, W(L1R1 ) performs almost as well as W(L1R1) on larger corpora, suggesting that position information is not as useful with large corpora, perhaps because the left and right set of words for each term becomes relatively disjoint.</Paragraph> <Paragraph position="1"> Table 3 presents the evaluation results for all the individual extractors and the six ensembles on the full corpus. At 300 million words all of the ensemble methods outperform the individual extractors. These results disagree with those Banko and Brill (2001) obtained for confusion set disambiguation. The best performing ensembles, MIXTURE( ) and MEAN( ), combine the results from all of the individual extractors. MIXTURE( ) performs approximately 5% better than SEXTANT, the best individual extractor.</Paragraph> <Paragraph position="2"> Figure 3 compares the performance behaviour over the range of corpus sizes for the best three individ- null ual methods and the full ensembles. SEXTANT is the only competitive individual method as the corpus size increases. Figure 3 shows that ensemble methods are of more value (at least in percentage terms) for smaller training sets. The trend in the graph suggests that the individual extractors will not outperform the ensemble methods, unless the behaviour changes as corpus size is increased further.</Paragraph> <Paragraph position="3"> From Table 3 we can also see that full ensembles, combining all the individual extractors, outperform ensembles combining only the top three extractors.</Paragraph> <Paragraph position="4"> This seems rather surprising at rst, given that the other individual extractors seem to perform signi cantly worse than the top three. It is interesting to see how the weaker methods still contribute to the ensembles performance.</Paragraph> <Paragraph position="5"> Firstly, for thesaurus extraction, there is no clear concept of accuracy greater than 50% since it is not a simple classi cation task. So, although most of the evaluation results are signi cantly less than 50%, this does not represent a failure of a necessary condition of ensemble improvement. If we constrain thesaurus extraction to selecting a single synonym classi cation using the P(1) scores, then all of the methods achieve 50% or greater accuracy. Considering the complementarity and rank-order correlation coef cients for the constituents of the different ensembles proves to be more informative. Table 4 shows these values for the smallest and largest corpora and Table 5 shows the pairwise complementarity for the ensemble constituents.</Paragraph> <Paragraph position="6"> It turns out that the average Spearman rank-order correlation is not sensitive enough to errors for the purposes of comparing favourable disagreement within ensembles. However, the average complementarity clearly shows the convergence of the ensemble constituents, which partially explains the reduced ef cacy of ensemble methods for large corpora. Since the top-three ensembles suffer this to a greater degree, they perform signi cantly worse at 300 million words. Further, the full ensembles can amortise the individual biases better since they average over a larger number of ensemble methods with different biases.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Analysis </SectionTitle> <Paragraph position="0"> Understanding ensemble behaviour on very large corpora is important because ensemble classi ers are state of the art for many NLP tasks. This section explores possible explanations for why our results disagree with Banko and Brill (2001).</Paragraph> <Paragraph position="1"> Thesaurus extraction and confusion set disambiguation are quite different tasks. In thesaurus extraction, contextual information is collected from the entire corpus into a single description of the environments that each term appears in and classi cation, as such, involves comparing these collections of data.</Paragraph> <Paragraph position="2"> In confusion set disambiguation on the other hand, each instance must be classi ed individually with only a limited amount of context. The disambiguator has far less information available to determine each classi cation. This has implications for representation sparseness and noise that a larger corpus helps to overcome, which in turn, affects the performance of ensemble methods against individual classi ers.</Paragraph> <Paragraph position="3"> The complexity of the contextual representation and the strength of the correlation between target term and the context also plays a signi cant role.</Paragraph> <Paragraph position="4"> Curran and Moens (2002b) have demonstrated that more complex and constrained contexts can yield superior performance, since the correlation between context and target term is stronger than simple window methods. Further, structural and grammatical relation methods can encode extra syntactic and semantic information in the relation type. Although the contextual representation is less susceptible to noise, it is often sparse because fewer context relations are extracted from each sentence.</Paragraph> <Paragraph position="5"> The less complex window methods exhibit the opposite behaviour. Depending on the window parameters, the context relations can be poorly correlated with the target term, and so we nd a very large number of irrelevant relations with low and unstable frequency counts, that is, a noisy contextual representation. Since confusion set disambiguation uses limited contexts from single occurrences, it is likely to suffer the same problems as the window thesaurus extractors.</Paragraph> <Paragraph position="6"> To evaluate an ensemble's ability to reduce the data sparseness and noise problems suffered by different context models, we constructed ensembles based on context extractors with different levels of complexity and constraints.</Paragraph> <Paragraph position="7"> Table 6 shows the performance on the full corpus for the three syntactic extractors, the top three performing extractors and their corresponding mean rank ensembles. For these more complex and constrained context extractors, the ensembles continue to outperform individual learners, since the context representation are still reasonably sparse. The aver- null age complementarity is greater than 50%.</Paragraph> <Paragraph position="8"> Table 7 shows the performance on the full corpus for a wide range of window-based extractors and corresponding mean rank ensembles. Most of the individual learners perform poorly because the extracted contexts are only weakly correlated with the target terms. Although the ensemble performs better than most individuals, they fail to outperform the best individual on direct match evaluation. Since the average complementarity for these ensembles is similar to the methods above, we must conclude that it is a result of the individual methods themselves. In this case, the most correlated context extractor, e.g.</Paragraph> <Paragraph position="9"> W(L1R1), extracts a relatively noise free representation which performs better than amortising the bias of the other noisy ensemble constituents.</Paragraph> <Paragraph position="10"> Finally, confusion set disambiguation yields a single classi cation from a small set of classes, whereas thesaurus extraction yields an ordered set containing every potential synonym. The more exible set of ranked results allow ensemble methods to exhibit more subtle variations in rank than simply selecting a single class.</Paragraph> <Paragraph position="11"> We can contrast the two tasks using the single synonym, P(1), and rank sensitive, INVR, evaluation measures. The results for P(1) do not appear to form any trend, although the results show that ensemble methods do not always improve single class selection. However, if we consider the INVR measure, all of the ensemble methods outperform their constituent methods, and we see a signi cant improvement of approximately 10% with the MEAN(3) ensemble. null</Paragraph> </Section> class="xml-element"></Paper>