File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1080_evalu.xml

Size: 10,694 bytes

Last Modified: 2025-10-06 13:59:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1080">
  <Title>Learning Word Senses With Feature Selection and Order Identification Capabilities</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Experiments and Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Test data
</SectionTitle>
      <Paragraph position="0"> We constructed four datasets from hand-tagged corpus 1 by randomly selecting 500 instances for each ambiguous word - &amp;quot;hard&amp;quot;, &amp;quot;interest&amp;quot;, &amp;quot;line&amp;quot;, and &amp;quot;serve&amp;quot;. The details of these datasets are given in Table 2. Our preprocessing included lowering the upper case characters, ignoring all words that contain digits or non alpha-numeric characters, removing words from a stop word list, and filtering out low frequency words which appeared only once in entire set. We did not use stemming procedure.</Paragraph>
      <Paragraph position="1"> The sense tags were removed when they were used by FSGMM and CGD. In evaluation procedure, these sense tags were used as ground truth classes.</Paragraph>
      <Paragraph position="2"> A second order co-occurrence matrix for English words was constructed using English version of Xinhua News (Jan. 1998-Dec. 1999). The window size for counting second order co-occurrence was 50 words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Evaluation method for feature selection
</SectionTitle>
      <Paragraph position="0"> For evaluation of feature selection, we used mutual information between feature subset and class label set to assess the importance of selected feature subset. Our assessment measure is defined as:</Paragraph>
      <Paragraph position="2"> where T is the feature subset to be evaluated, T W, L is class label set, p(w;l) is the joint distribution of two variables w and l, p(w) and p(l) are marginal probabilities. p(w;l) is estimated based  class label set L. Intuitively, if M(T1) &gt; M(T2), T1 is more important than T2 since T1 contains more information about L.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Evaluation method for clustering result
</SectionTitle>
      <Paragraph position="0"> When assessing the agreement between clustering result and hand-tagged senses (ground truth classes) in benchmark data, we encountered the difficulty that there was no sense tag for each cluster.</Paragraph>
      <Paragraph position="1"> In (Lange et al., 2002), they defined a permutation procedure for calculating the agreement between two cluster memberships assigned by different unsupervised learners. In this paper, we applied their method to assign different sense tags to only min(jUj;jCj) clusters by maximizing the accuracy, where jUj is the number of clusters, and jCj is the number of ground truth classes. The underlying assumption here is that each cluster is considered as a class, and for any two clusters, they do not share same class labels. At most jCj clusters are assigned sense tags, since there are only jCj classes in benchmark data.</Paragraph>
      <Paragraph position="2"> Given the contingency table Q between clusters and ground truth classes, each entry Qi;j gives the number of occurrences which fall into both the i-th cluster and the j-th ground truth class. If jUj &lt; jCj, we constructed empty clusters so that jUj = jCj. Let Ohm represent a one-to-one mapping function from C to U. It means that Ohm(j1) 6= Ohm(j2) if j1 6= j2 and vice versa, 1 * j1;j2 * jCj. Then Ohm(j) is the index of the cluster associated with the j-th class. Searching a mapping function to maximize the accuracy of U can be formulated as:</Paragraph>
      <Paragraph position="4"> In fact, Pi;j Qi;j is equal to N, the number of occurrences of target word in test set.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Experiments and results
</SectionTitle>
      <Paragraph position="0"> For each dataset, we tested following procedures: CGDterm:We implemented the context group discrimination algorithm. Top max(jWj PS 20%; 100) words in contextual word list was selected as features using frequency or '2 based ranking. Then k-means clustering2 was performed on context vector matrix using normalized Euclidean distance. K-means clustering was repeated 5 times  and the partition with best quality was chosen as final result. The number of clusters used by k-means was set to be identical with the number of ground truth classes. We tested CGDterm using various word vector weighting methods when deriving context vectors, ex. binary, idf, tf C/idf.</Paragraph>
      <Paragraph position="1"> CGDSVD: The context vector matrix was derived using same method in CGDterm. Then k-means clustering was conducted on latent semantic space transformed from context vector matrix, using normalized Euclidean distance. Specifically, context vectors were reduced to 100 dimensions using SVD. If the dimension of context vector was less than 100, all of latent semantic vectors with non-zero eigenvalue were used for subsequent clustering. We also tested it using different weighting methods, ex. binary, idf, tf C/idf.</Paragraph>
      <Paragraph position="2"> FSGMM: We performed cluster validation based feature selection in feature set used by CGD. Then Cluster algorithm was used to group target word's instances using Euclidean distance measure.</Paragraph>
      <Paragraph position="3"> ? was set as 0:90 in feature subset search procedure. The random splitting frequency is set as 10 for estimation of the score of feature subset. The initial subclass number was 20 and full covariance matrix was used for parameter estimation of each subclass.</Paragraph>
      <Paragraph position="4"> For investigating the effect of different context window size on the performance of three procedures, we tested these procedures using various context window sizes: SS1, SS5, SS15, SS25, and all of contextual words. The average length of sentences in 4 datasets is 32 words before preprocessing. Performance on each dataset was assessed by equation 19.</Paragraph>
      <Paragraph position="5"> The scores of feature subsets selected by FSGMM and CGD are listed in Table 3 and 4. The average accuracy of three procedures with different feature ranking and weighting method is given in Table 5. Each figure is the average over 5 different context window size and 4 datasets. We give out the detailed results of these three procedures in Figure 1. Several results should be noted specifically: From Table 3 and 4, we can find that FSGMM achieved better score on mutual information (MI) measure than CGD over 35 out of total 40 cases.</Paragraph>
      <Paragraph position="6"> This is the evidence that our feature selection procedure can remove noise and retain important features. null As it was shown in Table 5, with both '2 and freq based feature ranking, FSGMM algorithm performed better than CGDterm and CGDSVD if we used average accuracy to evaluate their performance. Specifically, with '2 based feature ranking, FSGMM attained 55:4% average accuracy, while the best average accuracy of CGDterm and CGDSVD were 40:9% and 51:3% respectively. With freq based feature ranking, FSGMM achieved 51:2% average accuracy, while the best average accuracy of CGDterm and CGDSVD were 45:1% and 50:2%.</Paragraph>
      <Paragraph position="7"> The automatically estimated cluster numbers by FSGMM over 4 datasets are given in Table 6.</Paragraph>
      <Paragraph position="8"> The estimated cluster number was 2 &gt;&gt; 4 for &amp;quot;hard&amp;quot;, 3 &gt;&gt; 6 for &amp;quot;interest&amp;quot;, 3 &gt;&gt; 6 for &amp;quot;line&amp;quot;, and 2 &gt;&gt; 4 for &amp;quot;serve&amp;quot;. It is noted that the estimated cluster number was less than the number of ground truth classes in most cases. There are some reasons for this phenomenon. First, the data is not balanced, which may lead to that some important features cannot be retrieved. For example, the fourth sense of &amp;quot;serve&amp;quot;, and the sixth sense of &amp;quot;line&amp;quot;, their corresponding features are not up to the selection criteria. Second, some senses can not be distinguished using only bag-of-words information, and their difference lies in syntactic information held by features. For example, the third sense and the sixth sense of &amp;quot;interest&amp;quot; may be distinguished by syntactic relation of feature words, while the bag of feature words occurring in their context are similar. Third, some senses are determined by global topics, rather than local contexts. For example, according to global topics, it may be easier to distinguish the first and the second sense of &amp;quot;interest&amp;quot;.</Paragraph>
      <Paragraph position="9"> Figure 2 shows the average accuracy over three procedures in Figure 1 as a function of context window size for 4 datasets. For &amp;quot;hard&amp;quot;, the performance dropped as window size increased, and the best accuracy(77:0%) was achieved at window size 1. For &amp;quot;interest&amp;quot;, sense discrimination did not benefit from large window size and the best accuracy(40:1%) was achieved at window size 5. For &amp;quot;line&amp;quot;, accuracy dropped when increasing window size and the best accuracy(50:2%) was achieved at window size 1. For &amp;quot;serve&amp;quot;, the performance benefitted from large window size and the best accuracy(46:8%) was achieved at window size 15.</Paragraph>
      <Paragraph position="10"> In (Leacock et al., 1998), they used Bayesian approach for sense disambiguation of three ambiguous words, &amp;quot;hard&amp;quot;, &amp;quot;line&amp;quot;, and &amp;quot;serve&amp;quot;, based on cues from topical and local context. They observed that local context was more reliable than topical context as an indicator of senses for this verb and adjective, but slightly less reliable for this noun. Compared with their conclusion, we can find that our result is consistent with it for &amp;quot;hard&amp;quot;. But there is some differences for verb &amp;quot;serve&amp;quot; and noun &amp;quot;line&amp;quot;. For  &amp;quot;serve&amp;quot;, the possible reason is that we do not use position of local word and part of speech information, which may deteriorate the performance when local context(* 5 words) is used. For &amp;quot;line&amp;quot;, the reason might come from the feature subset, which is not good enough to provide improvement when  context window size is no less than 5.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML