File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1049_evalu.xml

Size: 15,469 bytes

Last Modified: 2025-10-06 13:59:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1049">
  <Title>Word Sense Disambiguation Using Label Propagation Based Semi-Supervised Learning</Title>
  <Section position="6" start_page="397" end_page="401" type="evalu">
    <SectionTitle>
4 Experiments and Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="397" end_page="398" type="sub_section">
      <SectionTitle>
4.1 Experiment Design
</SectionTitle>
      <Paragraph position="0"> For empirical comparison with SVM and bootstrapping, we evaluated LP on widely used benchmark corpora - &amp;quot;interest&amp;quot;, &amp;quot;line&amp;quot; 1 and the data in English lexical sample task of SENSEVAL-3 (including all  aged over 20 trials) and paired t-test results of SVM and LP on SENSEVAL-3 corpus with percentage of training set increasing from 1% to 100%. The lower table lists the official result of baseline (using most frequent sense heuristics) and top 3 systems in ELS task of SENSEVAL-3.</Paragraph>
      <Paragraph position="1">  We used three types of features to capture contextual information: part-of-speech of neighboring words with position information, unordered single words in topical context, and local collocations (as same as the feature set used in (Lee and Ng, 2002) except that we did not use syntactic relations). For SVM, we did not perform feature selection on SENSEVAL-3 data since feature selection deteriorates its performance (Lee and Ng, 2002). When running LP on the three datasets, we removed the features with occurrence frequency (counted in both training set and test set) less than 3 times.</Paragraph>
      <Paragraph position="2"> We investigated two distance measures for LP: cosine similarity and Jensen-Shannon (JS) divergence (Lin, 1991).</Paragraph>
      <Paragraph position="3"> For the three datasets, we constructed connected graphs following (Zhu et al., 2003): two instances u,v will be connected by an edge if u is among v's k nearest neighbors, or if v is among u's k nearest neighbors as measured by cosine or JS distance measure. For &amp;quot;interest&amp;quot; and &amp;quot;line&amp;quot; corpora, k is 10 (following (Zhu et al., 2003)), while for SENSEVAL-3 data, k is 5 since the size of dataset for each word in SENSEVAL-3 is much less than that of &amp;quot;interest&amp;quot; and &amp;quot;line&amp;quot; datasets.</Paragraph>
    </Section>
    <Section position="2" start_page="398" end_page="398" type="sub_section">
      <SectionTitle>
4.2 Experiment 1: LP vs. SVM
</SectionTitle>
      <Paragraph position="0"> In this experiment, we evaluated LP and SVM 3 on the data of English lexical sample task in SENSEVAL-3. We used l examples from training set as labeled data, and the remaining training examples and all the test examples as unlabeled data. For each labeled set size l, we performed 20 trials.</Paragraph>
      <Paragraph position="1"> In each trial, we randomly sampled l labeled examples for each word from training set. If any sense was absent from the sampled labeled set, we redid the sampling. We conducted experiments with different values of l, including 1%xNw,train, 10%x Nw,train, 25%xNw,train, 50%xNw,train, 75%x Nw,train, 100%xNw,train (Nw,train is the number of examples in training set of word w). SVM and LP were evaluated using accuracy 4 (fine-grained score) on test set of SENSEVAL-3.</Paragraph>
      <Paragraph position="2"> We conducted paired t-test on the accuracy figures for each value of l. Paired t-test is not run when percentage= 100%, since there is only one paired accuracy figure. Paired t-test is usually used to estimate the difference in means between normal populations based on a set of random paired observations. {[?], [?]}, {&lt;, &gt;}, and [?] correspond to p-value [?] 0.01, (0.01,0.05], and &gt; 0.05 respectively. [?] (or [?]) means that the performance of LP is significantly better (or significantly worse) than SVM. &lt; (or &gt;) means that the performance of LP is better (or worse) than SVM.[?]means that the performance of LP is almost as same as SVM.</Paragraph>
      <Paragraph position="3"> Table 1 reports the average accuracies and paired t-test results of SVM and LP with different sizes of labled data. It also lists the official results of baseline method and top 3 systems in ELS task of SENSEVAL-3.</Paragraph>
      <Paragraph position="4"> From Table 1, we see that with small labeled dataset (percentage of labeled data [?] 10%), LP performs significantly better than SVM. When the percentage of labeled data increases from 50% to 75%, the performance of LPJS and SVM become almost same, while LPcosine performs significantly worse  curacies of LP with c x b labeled examples on &amp;quot;interest&amp;quot; and &amp;quot;line&amp;quot; corpora. Major is a baseline method in which they always choose the most frequent sense. MB-D denotes monolingual bootstrapping with decision list as base classifier, MB-B represents monolingual bootstrapping with ensemble of Naive Bayes as base classifier, and BB is bilingual bootstrapping with</Paragraph>
    </Section>
    <Section position="3" start_page="398" end_page="398" type="sub_section">
      <SectionTitle>
4.3 Experiment 2: LP vs. Bootstrapping
</SectionTitle>
      <Paragraph position="0"> Li and Li (2004) used &amp;quot;interest&amp;quot; and &amp;quot;line&amp;quot; corpora as test data. For the word &amp;quot;interest&amp;quot;, they used its four major senses. For comparison with their results, we took reduced &amp;quot;interest&amp;quot; corpus (constructed by retaining four major senses) and complete &amp;quot;line&amp;quot; corpus as evaluation data. In their algorithm, c is the number of senses of ambiguous word, and b (b = 15) is the number of examples added into classified data for each class in each iteration of bootstrapping. c x b can be considered as the size of initial labeled data in their bootstrapping algorithm.</Paragraph>
      <Paragraph position="1"> We ran LP with 20 trials on reduced &amp;quot;interest&amp;quot; corpus and complete &amp;quot;line&amp;quot; corpus. In each trial, we randomly sampled b labeled examples for each sense of &amp;quot;interest&amp;quot; or &amp;quot;line&amp;quot; as labeled data. The rest served as both unlabeled data and test data.</Paragraph>
      <Paragraph position="2"> Table 2 summarizes the average accuracies of LP on the two corpora. It also lists the accuracies of monolingual bootstrapping algorithm (MB), bilingual bootstrapping algorithm (BB) on &amp;quot;interest&amp;quot; and &amp;quot;line&amp;quot; corpora. We can see that LP performs much better than MB-D and MB-B on both &amp;quot;interest&amp;quot; and &amp;quot;line&amp;quot; corpora, while the performance of LP is comparable to BB on these two corpora.</Paragraph>
    </Section>
    <Section position="4" start_page="398" end_page="399" type="sub_section">
      <SectionTitle>
4.4 An Example: Word &amp;quot;use&amp;quot;
</SectionTitle>
      <Paragraph position="0"> For investigating the reason for LP to outperform SVM and monolingual bootstrapping, we used the data of word &amp;quot;use&amp;quot; in English lexical sample task of SENSEVAL-3 as an example (totally 26 examples in training set and 14 examples in test set). For data  (a) only one labeled example for each sense of word &amp;quot;use&amp;quot; as training data before sense disambiguation (* and [?] denote the unlabeled examples in SENSEVAL-3 training set and test set respectively, and other five symbols (+, x, ^, [?], and [?]) represent the labeled examples with different sense tags sampled from SENSEVAL-3 training set.), (b) ground-truth result, (c) classification result on SENSEVAL-3 test set by SVM (accuracy= 314 = 21.4%), (d) classified data after bootstrapping, (e) classification result on SENSEVAL-3 training set and test set by 1NN (accuracy= 614 = 42.9% ), (f) classification result on SENSEVAL-3 training set and test set by LP (accuracy= 1014 = 71.4% ).</Paragraph>
      <Paragraph position="1"> visualization, we conducted unsupervised nonlinear dimensionality reduction5 on these 40 feature vectors with 210 dimensions. Figure 3 (a) shows the dimensionality reduced vectors in two-dimensional space. We randomly sampled only one labeled example for each sense of word &amp;quot;use&amp;quot; as labeled data. The remaining data in training set and test set served as unlabeled data for bootstrapping and LP. All of these three algorithms are evaluated using accuracy on test set.</Paragraph>
      <Paragraph position="2"> From Figure 3 (c) we can see that SVM misclassi- null computing two-dimensional, 39-nearest-neighbor-preserving embedding of 210-dimensional input. Isomap is available at http://isomap.stanford.edu/.</Paragraph>
      <Paragraph position="3"> fied many examples from class + into class x since using only features occurring in training set can not reveal the intrinsic structure in full dataset.</Paragraph>
      <Paragraph position="4"> For comparison, we implemented monolingual bootstrapping with kNN (k=1) as base classifier.</Paragraph>
      <Paragraph position="5"> The parameter b is set as 1. Only b unlabeled examples nearest to labeled examples and with the distance less than dinter[?]class (the minimum distance between labeled examples with different sense tags) will be added into classified data in each iteration till no such unlabeled examples can be found.</Paragraph>
      <Paragraph position="6"> Firstly we ran this monolingual bootstrapping on this dataset to augment initial labeled data. The resulting classified data is shown in Figure 3 (d). Then a 1NN model was learned on this classified data and we used this model to perform classification on the remaining unlabeled data. Figure 3 (e) reports the final classification result by this 1NN model. We can see that bootstrapping does not perform well since it is susceptible to small noise in dataset. For example, in Figure 3 (d), the unlabeled example B 6 happened to be closest to labeled example A, then 1NN model tagged example B with label [?]. But the correct label of B should be + as shown in Figure 3 (b). This error caused misclassification of other unlabeled examples that should have label +.</Paragraph>
      <Paragraph position="7"> In LP, the label information of example C can travel to B through unlabeled data. Then example A will compete with C and other unlabeled examples around B when determining the label of B. In other words, the labels of unlabeled examples are determined not only by nearby labeled examples, but also by nearby unlabeled examples. Using this classification strategy achieves better performance than the local consistency based strategy adopted by SVM and bootstrapping.</Paragraph>
    </Section>
    <Section position="5" start_page="399" end_page="401" type="sub_section">
      <SectionTitle>
4.5 Experiment 3: LPcosine vs. LPJS
</SectionTitle>
      <Paragraph position="0"> Table 3 summarizes the performance comparison between LPcosine and LPJS on three datasets. We can see that on SENSEVAL-3 corpus, LPJS per6In the two-dimensional space, example B is not the closest example to A. The reason is that: (1) A is not close to most of nearby examples around B, and B is not close to most of nearby examples around A; (2) we used Isomap to maximally preserve the neighborhood information between any example and all other examples, which caused the loss of neighborhood information between a few example pairs for obtaining a globally optimal solution.</Paragraph>
      <Paragraph position="1">  LPJS and the results of three model selection criteria are reported in following two tables. In the lower table, &lt; (or &gt;) means that the average value of function H(Qcosine) is lower (or higher) than H(QJS), and it will result in selecting cosine (or JS) as distance measure. Qcosine (or QJS) represents a matrix using cosine similarity (or JS divergence). [?] and x denote correct and wrong prediction results respectively, while*means that any prediction is acceptable.</Paragraph>
      <Paragraph position="2">  forms significantly better than LPcosine, but their performance is almost comparable on &amp;quot;interest&amp;quot; and &amp;quot;line&amp;quot; corpora. This observation motivates us to automatically select a distance measure that will boost the performance of LP on a given dataset.</Paragraph>
      <Paragraph position="3"> Cross-validation on labeled data is not feasible due to the setting of semi-supervised learning (l [?] u). In (Zhu and Ghahramani, 2002; Zhu et al., 2003), they suggested a label entropy criterion H(YU) for model selection, where Y is the label matrix learned by their semi-supervised algorithms.</Paragraph>
      <Paragraph position="4"> The intuition behind their method is that good parameters should result in confident labeling. Entropy on matrix W (H(W)) is a commonly used measure for unsupervised feature selection (Dash and Liu, 2000), which can be considered here. Another possible criterion for model selection is to measure the entropy of c x c inter-class distance matrix D calculated on labeled data (denoted as H(D)), where Di,j represents the average distance between the i-th class and the j-th class. We will investigate three criteria, H(D), H(W) and H(YU), for model selection. The distance measure can be automatically selected by minimizing the average value of function H(D), H(W) or H(YU) over 20 trials.</Paragraph>
      <Paragraph position="5"> Let Q be the M xN matrix. Function H(Q) can measure the entropy of matrix Q, which is defined as (Dash and Liu, 2000):</Paragraph>
      <Paragraph position="7"> (2) where a is positive constant. The possible value of a is[?]ln0.5-I , where -I = 1MN summationtexti,j Qi,j. S is introduced for normalization of matrix Q. For SENSEVAL-3 data, we calculated an overall average score of H(Q) by summationtextw Nw,testsummationtext w Nw,test H(Qw). Nw,test is the number of examples in test set of word w. H(D), H(W) and H(YU) can be obtained by replacing Q with D, W and YU respectively.</Paragraph>
      <Paragraph position="8"> Table 3 reports the automatic prediction results of these three criteria.</Paragraph>
      <Paragraph position="9"> From Table 3, we can see that using H(W) can consistently select the optimal distance measure when the performance gap between LPcosine and LPJS is very large (denoted by[?]or[?]). But H(D) and H(YU) fail to find the optimal distance measure when only very few labeled examples are available (percentage of labeled data [?] 10%).</Paragraph>
      <Paragraph position="10"> H(W) measures the separability of matrix W.</Paragraph>
      <Paragraph position="11"> Higher value of H(W) means that distance measure decreases the separability of examples in full dataset. Then the boundary between clusters is obscured, which makes it difficult for LP to locate this boundary. Therefore higher value of H(W) results in worse performance of LP.</Paragraph>
      <Paragraph position="12"> When labeled dataset is small, the distances between classes can not be reliably estimated, which results in unreliable indication of the separability of examples in full dataset. This is the reason that H(D) performs poorly on SENSEVAL-3 corpus when the percentage of labeled data is less than 25%. For H(YU), small labeled dataset can not reveal intrinsic structure in data, which may bias the estimation of YU. Then labeling confidence (H(YU)) can not properly indicate the performance of LP.</Paragraph>
      <Paragraph position="13"> This may interpret the poor performance of H(YU) on SENSEVAL-3 data when percentage [?] 25%.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML