File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0812_intro.xml

Size: 2,994 bytes

Last Modified: 2025-10-06 14:01:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0812">
  <Title>Evaluating the Effectiveness of Ensembles of Decision Trees in Disambiguating Senseval Lexical Samples</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Lexical Features
</SectionTitle>
    <Paragraph position="0"> Unigram features represent words that occur five or more times in the training examples associated with a given target word. A stop-list is used to eliminate high frequency function words as features.</Paragraph>
    <Paragraph position="1"> For example, if the target word is water and the training example is I water the flowering flowers, the unigrams water, flowering and flowers are evaluated as possible unigram features. No stemming or other morphological processing is performed, so flowering and flowers are considered as distinct unigrams. I and the are not considered as possible features since they are included in the stop-list.</Paragraph>
    <Paragraph position="2"> Bigram features represent two word sequences that occur two or more times in the training examples associated with a target word, and have a log-likelihood value greater than or equal to 6.635. This corresponds to a p-value of 0.01, which indicates that according to the log-likelihood ratio there is a 99% probability that the words that make up this bi-gram are not independent.</Paragraph>
    <Paragraph position="3"> If we are disambiguating channel and have the training example Go to the channel quickly, then the three bigrams Go to, the channel, and channel quickly will be considered as possible features. to the is not included since both words are in the stoplist. null Co-occurrence features are defined to be a pair of words that include the target word and another word within one or two positions. To be selected as a feature, a co-occurrence must occur two or more times in the lexical sample training data, and have a log-likelihood value of 2.706, which corresponds to a p-value of 0.10. A slightly higher p-value is used for the co-occurrence features, since the volume of data is much smaller than is available for the bigram features.</Paragraph>
    <Paragraph position="4"> If we are disambiguating art and have the training example He and I like art of a certain period,we evaluate I art, like art, art of, and art a as possible co-occurrence features.</Paragraph>
    <Paragraph position="5"> All of these features are binary, and indicate if the designated unigram, bigram, or co-occurrence appears in the context with the ambiguous word. Once the features are identified from the training examples using the methods described above, the decision tree learner selects from among those features to determine which are most indicative of the sense of the ambiguous word. Decision tree learning is carried out with the Weka J48 algorithm (Witten and Frank, 2000), which is a Java implementation of the classic C4.5 decision tree learner (Quinlan, 1986).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML