XML Viewer - w04-0863

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0863_metho.xml
Size: 9,122 bytes
Last Modified: 2025-10-06 14:09:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0863">
  <Title>Joining forces to resolve lexical ambiguity: East meets West in Barcelona</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Experimental Features
</SectionTitle>
    <Paragraph position="0"> A full description of the experimental features for all four tasks can be found in the report submitted by the Swarthmore College Senseval team (Wicentowski et al., 2004). Briefly, the systems used lexical and syntactic features in the context of the target word: The &amp;quot;bag of words (and lemmas)&amp;quot; in the context of the ambiguous word.</Paragraph>
    <Paragraph position="1"> Bigrams and trigrams of words (and lemmas,  part-of-speech tags, and, for Basque, case information) surrounding the ambiguous word.</Paragraph>
    <Paragraph position="2"> The topic (or code) of the document containing the current instance of the word was extracted. (Basque and Catalan only.) These features have been shown to be effective in previous WSD research. Since our systems were all supervised, all the data used was provided by the Senseval organizers; no additional (unlabeled) data was included.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"> The systems that were constructed by this team included two component models: a boosting model and a maximum entropy model as well as a combination system. The component models were also used in other Senseval-3 tasks: Semantic Role Labeling (Ngai et al., 2004) and the lexical sample tasks for Chinese and English, as well as the Multi-lingual task (Carpuat et al., 2004).</Paragraph>
    <Paragraph position="1"> To perform parameter tuning for the two component models, 20% of the samples from the training set were held out into a validation set. Since we did not expect the senses of different words to share any information, the training data was partitioned by the ambiguous word in question. A model was then trained for each ambiguous word type. In total, we had 40 models for Basque, 27 models for Catalan, 45 models for Italian and 39 models for Romanian.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Boosting
</SectionTitle>
      <Paragraph position="0"> Boosting is a powerful machine learning algorithm which has been shown to achieve good results on a variety of NLP problems. One known property of boosting is its ability to handle large numbers of features. For this reason, we felt that it would be well suited to the WSD task, which is known to be highly lexicalized with a large number of possible word types.</Paragraph>
      <Paragraph position="1"> Our system was constructed around the Boostexter software (Schapire and Singer, 2000), which implements boosting on top of decision stumps (deci-Association for Computational Linguistics for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems sion trees of one level), and was originally designed for text classification.</Paragraph>
      <Paragraph position="2"> Tuning a boosting system mainly lies in modifying the number of iterations, or the number of base models it would learn. Larger number of iterations contribute to the boosting model's power. However, they also make it more prone to overfitting and increase the training time. The latter, a simple disadvantage in another problem, becomes a real issue for Senseval, since large numbers of models (one for each word type) need to be trained in a short period of time.</Paragraph>
      <Paragraph position="3"> Since the available features differed from language to language, the optimal number of iterations also varied. Table 1 shows the performance of the model on the validation set with respect to the number of iterations per language.</Paragraph>
      <Paragraph position="4">  The final systems for the languages used 2000 iterations for Basque and Catalan and 500 iterations for Italian and Romanian. The test set results are shown in Table 4</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Maximum Entropy
</SectionTitle>
      <Paragraph position="0"> The other individual system was based on the maximum entropy model, another machine learning algorithm which has been successfully applied to many NLP problems. Our system was implemented on top of the YASMET package (Och, 2002).</Paragraph>
      <Paragraph position="1"> Due to lack of time, we did not manage to fine-tune the maximum entropy model. The YASMET package does provide a number of easily variable parameters, but we were only able to try varying the feature selection count threshold and the smoothing parameter, and only on the Basque data.</Paragraph>
      <Paragraph position="2"> Experimentally, however, smoothing did not seem to make a difference. The only change in performance was caused by varying the feature selection count threshold, which controls the number of times a feature has to be seen in the training set in order to be considered. Table 2 shows the performances of the system on the Basque validation set, with count thresholds of 0, 1 and 2.</Paragraph>
      <Paragraph position="3"> Since word sense disambiguation is known to be  idation set.</Paragraph>
      <Paragraph position="4"> a highly lexicalized task involving many feature values and sparse data, it is not too surprising that setting a low threshold of 1 proves to be the most effective. The final system kept this threshold, smoothing was not done and the GIS iterations allowed to proceed until it converged on its own. These parameters were used for all four languages.</Paragraph>
      <Paragraph position="5"> The maximum entropy model was not entered into the competition as an official contestant; however, it did participate in the combined system.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Combined System
</SectionTitle>
      <Paragraph position="0"> Ensemble methods have been widely studied in NLP research, and it is well-known that a set of systems will often combine to produce better results than those achieved by the best individual system alone. The final system contributed by the Swarthmore-Hong Kong team was such an ensemble. In addition to the boosting and maximum entropy models mentioned earlier, three other models were included: a nearest-neighbor clustering model, a decision list, and a Na&amp;quot;ive Bayes model. The five models were then combined by a simple weighted majority vote, with an ad-hoc weight of 1.1 given to the boosting and decision lists systems, and 1.0 otherwise, with ties broken arbitrarily.</Paragraph>
      <Paragraph position="1"> Due to an unfortunate error with the input data of the voting algorithm (Wicentowski et al., 2004), the official submitted results for the combined system were poorer than they should have been. Table 3 compares the official (submitted) results to the corrected results on the test set. The decrease in performance caused by the error ranged from 0.9% to 3.3%.</Paragraph>
      <Paragraph position="2">  cial contestants are in bold; corrected voting results are in parentheses. Key: NB: Na&amp;quot;ive Bayes, NNC: Nearest-Neighbor Clustering, DL: Decision List</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Test Set Results
</SectionTitle>
    <Paragraph position="0"> Final results from all the systems are shown in Table 4. As a reference, the results of a simple base-line system which assigns the most frequent sense as seen in the training set is also provided.</Paragraph>
    <Paragraph position="1"> Due to the error in the voting system, the official results for the combination system were lower than they should have been -- as a result, boosting was officially the top ranked system for 3 of the 4 languages. With the corrected results, however, the combined system outperforms the individual models, as expected. The only exception is Basque, where the booster had an exceptionally strong performance. This is probably due to the fact that Basque has a much richer feature set than the other languages, which boosting was better able to take advantage of.</Paragraph>
    <Paragraph position="2"> The poor performance of the maximum entropy model was also unexpected at first; however, it is perhaps not too surprising, given the lack of time spent on fine-tuning the model. As a result, most of the parameters were left at their default values.</Paragraph>
    <Paragraph position="3"> One thing worth noting is the fact that the systems were combined as &amp;quot;closed systems&amp;quot; -- i.e. all that was known about them was the output result, and nothing else. The result was that no confidence measures from the boosting and maximum entropy could be used in the combined system. It is likely that the performance could have been further improved if more information had been available.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML