File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0862_metho.xml

Size: 11,199 bytes

Last Modified: 2025-10-06 14:09:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0862">
  <Title>The Swarthmore College SENSEVAL3 System</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Systems
</SectionTitle>
    <Paragraph position="0"> The following systems were used to complete the Basque, Catalan, Italian and Romanian lexical sample tasks. The Spanish lexical sample task was completed before the other four tasks were begun and used only a subset of the systems presented below.</Paragraph>
    <Paragraph position="1"> Full details on the systems and methods used for the Spanish lexical sample task can be found in Section 7.3.</Paragraph>
    <Paragraph position="2"> See Section 4 for details on the classifier combination, and Section 5.2 for information about our use of bagging.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Cosine-based Clustering
</SectionTitle>
      <Paragraph position="0"> The first system developed was a nearest-neighbor clustering method which used the cosine similarity of feature vectors as the distance metric. A centroid was created for each attested sense in the training data, and each test sample was assigned to a cluster based on its similarity to the existing centroid. Centroids were not recalculated after each added test instance. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Na&amp;quot;ive Bayes
</SectionTitle>
      <Paragraph position="0"> The second system used was a na&amp;quot;ive Bayes classifier where the similarity between an instance, I, and a sense class, Sj, is defined as:</Paragraph>
      <Paragraph position="2"> We then choose the sense class, Sj, which maximized the similarity function above, making standard independence assumptions.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Decision List
</SectionTitle>
      <Paragraph position="0"> The final system was a decision list classifier that found the log-likelihoods of the correspondence be-</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Association for Computational Linguistics
</SectionTitle>
      <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems tween features and senses, using plus-one smoothing (Yarowsky, 1994). The features were ordered from most to least indicative to form the decision list. A separate decision list was constructed for each set of lexical samples in the training data. For each test instance, the first matching feature found in the associated decision list was used to determine the classification of the instance. Instances which failed to match any rule in the decision list were labeled with the most frequent sense, as calculated from the training data.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
4 Classifier Combination
</SectionTitle>
    <Paragraph position="0"> Due to time constraints, we were unable to get cross-validation results for all of the systems we created, and therefore all of the final classifier combination was done using simple majority vote, breaking ties arbitrarily. To reach a consensus vote, we combined the multiple decision list systems, which had been run on each of the different subsets of extracted features, into a single system. We then did the same for the clustering system and the na&amp;quot;ive Bayes system, yielding a total of three new systems. These three systems were then voted together to form the final system. The two-tiered voting was performed to ensure equal voting in the case of our joint work (Wicentowski et al., 2004) where the five systems that needed to be combined were run on different numbers of feature subsets.</Paragraph>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
4.1 Combination Errors
</SectionTitle>
      <Paragraph position="0"> There were two mistakes we made when voting our systems together. We caught one mistake after the submission deadline but before notification of results; the other we realized only while evaluating our systems after receiving our results. For this reason, there are three sets of results that we will report here: * Official results: The results from our submission to SENSEVAL3.</Paragraph>
      <Paragraph position="1"> * Unofficial results: Includes a bug-fix found before notification of competition results.</Paragraph>
      <Paragraph position="2"> * Tie-breaking results: Includes a bug-fix found after notification of results.</Paragraph>
      <Paragraph position="3"> In doing the evaluation of our system for this paper, we will use the unofficial results1. Because of the nature of the bug-fix, evaluating our system based on the official results will yield less informative results than an evaluation of results after fixing 1As mentioned previously, Spanish is a special case and we will report only our official results.</Paragraph>
      <Paragraph position="4"> the error. Since these unofficial results were obtained before notification of results from the competition organizers, we believe this to be a fair comparison. null 4.1.1 Over-weighting part-of-speech n-grams The bug which yielded our unofficial results occurred when we combined the multiple decision list systems into a single decision list system (and similarly for the multiple clustering and na&amp;quot;ive Bayes systems). As discussed in Section 2, we experimented with forming partial labels for the part-of-speech tags to reduce the sparse-data issues: using the full part-of-speech tag, using only the first letter of the tag, and using the first two letters of the tag. However, in the final combination, we ended up including all three methods in the voting, instead of including only one. Obviously, these three classifiers, based solely on part-of-speech n-grams around the target word, had a high rate of agreement and were therefore over-weighted in the final voting. Our systems underperformed where they should have, with the surprising exception of Catalan, which performed better with the mistake than without it. Table 1 compares our official results with our unofficial results.</Paragraph>
      <Paragraph position="5">  from making a bug-fix before notification of results, but after the submission deadline.</Paragraph>
      <Paragraph position="6">  Our classifier combination used a non-informed method for breaking ties: whichever sense had the first hash code (as determined by Perl's hash function) was chosen. Our inability to complete cross-validation experiments led us to not favor any one classifier over another. Performance would have been improved by using an ad-hoc weighting scheme which took into account the following intuitions: null * Initial experiments indicated that the instances of the classifiers with access to the full set of features would outperform the instances running on limited subsets of the features.</Paragraph>
      <Paragraph position="7"> * Empirical evidence suggested that the decision list classifier was the best, the clustering method a strong second, and the na&amp;quot;ive Bayes method a distant third.</Paragraph>
      <Paragraph position="8"> In fairness, we did not discover this mistake until we were preparing this paper, only after receiving notification of our results. While we report our revised results, we make no further comparisons based on these results. In addition, we ran no extra experiments to determine the weighting scheme listed below, we simply used our intuition based on our earlier experimentation as noted above. These intuitions were not always correct, as indicated in Table 5 and Table 6.</Paragraph>
      <Paragraph position="9"> Using very simple ad-hoc weights which back up these intuitions, we changed our classifier combination system to break ties according to the following scheme: In the first tier of voting, we fractionally increased the weight given to the classifiers run on the full-feature set: instead of each system receiving exactly one vote, we gave those systems an extra  10th of a vote. In the second tier of voting, we madethe same fractional increase to the weight given to the decision list classifier. Use of this tie breaking scheme increases our results impressively, as shown  second column also includes the bug fix described in SS4.1.1. Note that the tie-breaking error was found after notification of our final results.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
5 Additional features
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.1 Collocational Senses
</SectionTitle>
      <Paragraph position="0"> In the Basque and Romanian tasks, senses could be labeled either as numbered alternatives or as a collocational sense. For example, the Basque wordastuncould be labeled with the collocational sense pisu astun.</Paragraph>
      <Paragraph position="1"> From the SENSEVAL2 English lexical-sample task, we found there were 175 words labeled with a collocational sense. A lemmatized form of the collocation was found in 96.6% of these when considering a +-2-word window around the target. To take advantage of this expected behavior in Basque and Romanian, we labeled a target word with a collocational sense if we found the lemmatized collocation in a +-2-word window. In Romanian, many collocations contained prepositions or other stop-words; therefore, we labeled a target word with the collocational sense only if a non-stop-word from the collocation was found in the +-2-word window. Overall, as shown in Table 3, this decision proved to be reasonably effective.</Paragraph>
      <Paragraph position="2">  Complementary to this issue, a sampling of the same English data indicated that if a target word was part of a previously seen collocation, it was highly unlikely that this word would not be tagged with the collocational sense. Therefore, we expected it would be advantageous if we could remove the collocational senses from the training data to prevent target words which were not part of collocations from being tagged as such. Based on cross-validated results, we found that this was worthwhile for Basque, but not for Romanian, where there were many examples of a target word being tagged as collocational sense without the collocation being present.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 Bagging
</SectionTitle>
      <Paragraph position="0"> For the decision list and clustering systems, we used bagging (Breiman, 1996) to train on five randomly sampled instances of the training data which were combined using a simple majority vote. We limited ourselves to five samples due to time limitations imposed by the competition. We found a consistent, but minimal, improvement for each of the four tasks due to our use of bagging, as shown below in Table 4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML