XML Viewer - w04-0827

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0827_metho.xml
Size: 13,714 bytes
Last Modified: 2025-10-06 14:09:11
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0827">
  <Title>GAMBL, Genetic Algorithm Optimization of Memory-Based WSD</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Information sources
</SectionTitle>
    <Paragraph position="0"> Preprocessing. The training corpus is a concatenation of various sense-tagged English texts: it contains SemCor (included with WordNet 1.7.1), training and test data from the English lexical sample (LS) and all words (AW) tasks from previous SENSEVAL workshops, the line-, hard- and servecorpora, and the example sentences in WordNet 1.7.1. This corpus contains 4.494.909 tokens of which 555.269 are sense-tagged words.</Paragraph>
    <Paragraph position="1"> To this corpus, we add the training data from the SENSEVAL-3 English LS task, containing 7860 sense-tagged words. For the AW task, we simply append the LS training data after conversion of the verb's WordSmyth senses to WordNet 1.7.1 senses. For the LS task, however, we slightly change the design of the word expert module because (i) WordSmyth senses are used for the verbs, and (ii) for some words in the LS task, the sense distribution in our own training corpus is very different from the distribution in the LS training data - we did not want this difference to (heavily) influence the results.</Paragraph>
    <Paragraph position="2"> Figure 2 shows the word expert module used in the LS task: we first generate a sense prediction using classifier 1A, trained on our own training data using context keywords as features. This prediction becomes an extra feature in classifier 1B, also trained on our own training data but using local context as information source. Finally, the predictions of classifiers 1A and 1B become extra features for classifier 2: this classifier is trained on the LS training data, and uses local context for disambiguating senses.</Paragraph>
    <Paragraph position="3"> The test data in the English LS task contains 3944 words to be sense-tagged (57 unique word-lemma-POS-tag combinations), and in the English AW task 2041 words (1020 combinations). Training and test data are linguistically analyzed: first, we tokenize, POS-tag, and find chunks and grammatical relations in the data with a shallow parser, and then we lemmatize the data. These tools were developed locally.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
WORD EXPERTMODULE
LEXICAL SAMPLE
</SectionTitle>
    <Paragraph position="0"> keywords above threshold to classifier 2 prediction of classifier 1Ais sent as a feature to classifier 1B predictions of classifiers 1A CLASSIFIER 2 (LS[?]data)local context and predictions based on non[?]LS[?]data and on binary keywords binary representation of contextCLASSIFIER 1A (non[?]LS[?]data)local context and predictionbased on binary keywordsCLASSIFIER 1B (non[?]LS[?]data) and 1B are sent as features optimization heuristicparameter heuristic optimizationparameter In our training data we find 3433 word-lemma-POS-tag combinations that fulfilled the word expert criteria: in the LS test data, these word experts cover all 57 word-lemma-POS-tag combinations, and in the AW test data, they cover 596 combinations, or 1448 particular instances (70.95%).</Paragraph>
    <Paragraph position="1"> We will continue with a description of how we create local context feature vectors, and extract key-words to create binary feature vectors.</Paragraph>
    <Paragraph position="2"> Local context. The second classifier uses the immediate local context of a focus word-lemma-POS-tag combination to disambiguate its senses: the focus word itself, and the three words before and after it. For each of these seven words, we include in the feature vector the POS-tag and the chunk+relation-tag assigned to the word by the shallow parser. The chunk+relation-tag contains information on the basic phrase type of the word (nominal, verbal, prepositional), and for nominal phrases also information on the grammatical function (subject or object) of the phrase.</Paragraph>
    <Paragraph position="3"> We set the context window size to SS 3 for practical reasons: in the optimization step, we use a genetic algorithm for feature selection. This algorithm will determine which features from the context window will eventually be used in the classification step. Increasing the initial context window size, however, also increases the amount of computer time needed for the optimization step. Using a larger context window was computationally not feasible. null Finally, to these local context features, we add the prediction of the keywords-in-context classifier as an extra feature. We will now explain how we extract the keywords and how we generate predictions for our training items.</Paragraph>
    <Paragraph position="4"> Keywords in context. The first classifier of each word expert is trained on information about possible disambiguating keywords in a context of three sentences: the sentence in which the ambiguous word occurs, the previous sentence, and the following sentence. The method we use to extract the key-words for each sense is based on the work of Ng and Lee (1996). They determine the probability of a sense s of a focus lemma f given keyword k by dividing Ns;kloc (the number of occurrences of a possible local context keyword k with a particular focus word-lemma-POS-tag combination w with a particular sense s) by Nkloc (the number of occurrences of a possible local context keyword kloc with a particular focus word-lemma-POS-tag combination w regardless of its sense). In addition, we also take into account the frequency of a possible keyword in the complete training corpus Nkcorp:</Paragraph>
    <Paragraph position="6"> Words were selected as keywords for a sense if (i) they appeared at least three times in the context of that sense, and (ii) p(sjk) was higher than or equal to 0.001.</Paragraph>
    <Paragraph position="7"> To this collection of local context keywords we add possible disambiguating content words extracted from the WordNet sense definitions for each focus word-lemma-POS-tag combination. All the keywords are represented as binary features, of which the value is 1 if the keyword is present in the three-sentence-context, and 0 if not.</Paragraph>
    <Paragraph position="8"> For each training item in the word experts, we generate a keyword-based prediction. First, we split the complete set of training items for each word expert in ten folds of equal size. We then use nine folds to predict the sense of the remaining fold, after having found an optimal parameter setting for TIMBL with heuristic optimization on the nine folds.</Paragraph>
    <Paragraph position="9"> We repeat this procedure for each fold. Finally, for each training item, we append its keyword-based prediction to the local context feature vector.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Training and optimization
</SectionTitle>
    <Paragraph position="0"> In previous work on memory-based WSD (Veenstra et al., 2000; Hoste et al., 2002) we showed that optimization of features and algorithm parameters for each word expert independently contributes considerably to accuracy. For classifier 1 in the AW task, and for classifiers 1A and 1B in the LS task, we heuristically determine the optimal algorithm parameter settings: we exhaustively try out all possible combinations of (a selection of) distance metrics, feature-weightings, number of nearest neighbors and nearest neighbor voting schemes, and retain the best result. The testing of one setting is done with ten-fold cross-validation.</Paragraph>
    <Paragraph position="1"> For classifier 2, we use a genetic algorithm (GA, e.g. (Goldberg, 1989)) to do joint parameter optimization and feature selection. We refer to (Daelemans et al., 2003a) for a discussion of the effect of joint parameter optimization and feature selection on accuracy of classifiers for NLP tasks. Joint feature selection and parameter optimization is an optimization problem which involves searching the space of all possible feature subsets and parameter settings to identify the combination that is optimal or near-optimal. Since exhaustive search in large search spaces is computationally not feasible in practice, a GA is a more realistic approach to search the space. Contrary to traditional hill-climbing approaches, such as backward selection, the GA explores different areas of the search space in parallel.</Paragraph>
    <Paragraph position="2"> For the experiments we use a generational GA implemented in the DeGA (Distributed Evaluation Genetic Algorithm) framework 2. We use the GA in its default settings. The GA optimization is performed using 10-fold cross-validation on the available training data. The resulting optimal settings are then applied to the test data. In the experiments, the individuals are represented as bit strings (Figure 3). Each individual contains particular values for all algorithm settings and for the selection of the features. For TIMBL, the large majority of these features control the use of a feature (ignore, or a distance metric) and are encoded in the chromosome as ternary alleles. At the end of the chromosome, the 5-valued weighting parameter and the 4-valued neighbor weighting parameter are encoded, together with the k parameter which controls the number of neighbors. The latter is encoded as a real value which represents the logarithm of the number of neighbors.</Paragraph>
    <Paragraph position="3"> We will now present the results of our WSD architecture on the LS and AW test sets.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental results
</SectionTitle>
    <Paragraph position="0"> English lexical sample task. Table 1 presents the results of our WSD system for each word in the LS task, and our overall score (the opt column).</Paragraph>
    <Paragraph position="1"> We included the results of TIMBL with default settings (the def column) and the score of a statistical baseline (the maj column), which assigns the sense</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Parameters
</SectionTitle>
      <Paragraph position="0"> with the highest frequency in the training set to the test instances. For comparison, we also list ten-fold cross-validation results (with default and optimized settings) of the second classifier on the training set.</Paragraph>
      <Paragraph position="1"> Looking at the overall score, we see that TIMBL with default settings already outperforms the base-line with 5%, and that the TIMBL classifier optimized with the GA, improves our score even more with another 7%.</Paragraph>
      <Paragraph position="2"> For most words, the improvement after optimization with the genetic algorithm on the training set, also holds on the test set, though for 15 words, the optimal setting from the GA does not result in a better score than the default score. For four words, TIMBL and the GA cannot outperform the majority sense baseline. We do not yet know what causes TIMBL and the GA to perform badly, but a difference between the sense distributions in the training and test set might be a factor. The distribution of the majority sense in the training set of source is 48.4%, while in the test set this distribution increases to 62.6%. For important there is a similar increase: from 38.9% to 47.4%. However, sense distribution differences in training and test set cannot be the only cause, because for activate and lose there is no such difference between the sense distributions.</Paragraph>
      <Paragraph position="3"> Finally, Table 2 depicts the fine-grained classification accuracies of our system per POS in the LS task, again compared with the accuracies of the majority sense baseline and TIMBL with default settings. The classification accuracy for nouns and verbs is more or less the same as the overall score.</Paragraph>
      <Paragraph position="4"> Adjectives, however, seem to be the harder to classify for our system: the classification accuracy is 13% lower than the overall score. This could be related to the on average higher number of senses for the adjectives.</Paragraph>
      <Paragraph position="5"> English all words task. The last column of Table 3 presents our results on the AW test set: the results of the classifier optimized with the GA are compared with the results of TIMBL with default settings, and with a majority sense baseline, which</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TRAINING TEST
WORD EXPERT WORDS
</SectionTitle>
    <Paragraph position="0"> predicts for each word to be sense-tagged the sense that is listed in WordNet as the most frequent one.</Paragraph>
    <Paragraph position="1"> The first half of the table lists the results when we only take into account words for which a word expert is built. TIMBL with default settings cannot outperform the already strong baseline, but after optimization with the GA, we see a 4% improvement. Unfortunately, this increase is not as high as the performance boost we see in the ten-fold cross-validation results on the training set, listed in the first column of Table 3: there is a large increase of 12% after the optimization step.</Paragraph>
    <Paragraph position="2"> Words for which no word expert is built are tagged with their majority sense from WordNet.</Paragraph>
    <Paragraph position="3"> When we also take these words into account, we see similar results: again, default TIMBL cannot outperform the baseline, but GA optimization gives a 3% increase.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML