XML Viewer - w06-3808

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3808_metho.xml
Size: 15,792 bytes
Last Modified: 2025-10-06 14:10:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3808">
  <Title>Seeing stars when there aren't many stars: Graph-based semi-supervised learning for sentiment categorization</Title>
  <Section position="4" start_page="46" end_page="47" type="metho">
    <SectionTitle>
3 Graph-Based Semi-Supervised Learning
</SectionTitle>
    <Paragraph position="0"> With the graph defined, there are several algorithms one can use to carry out semi-supervised learning (Zhu et al., 2003; Delalleau et al., 2005; Joachims, 2003; Blum and Chawla, 2001; Belkin et al., 2005).</Paragraph>
    <Paragraph position="1"> The basic idea is the same and is what we use in this paper. That is, our rating function f(x) should be smooth with respect to the graph. f(x) is not smooth if there is an edge with large weight w between nodes xi and xj, and the difference between f(xi) and f(xj) is large. The (un)smoothness over the particular edge can be defined as wparenleftbigf(xi) [?] f(xj)parenrightbig2. Summing over all edges in the graph, we obtain the (un)smoothness L(f) over the whole graph. We call L(f) the energy or loss, which should be minimized.</Paragraph>
    <Paragraph position="2"> Let L = 1...l and U = l + 1...n be labeled and unlabeled review indices, respectively. With the graph in Figure 1, the loss L(f) can be written as</Paragraph>
    <Paragraph position="4"> A small loss implies that the rating of an unlabeled review is close to its labeled peers as well as its unlabeled peers. This is how unlabeled data can participate in learning. The optimization problem is minf L(f). To understand the role of the parameters, we define a = ak + bkprime and b = ba, so that</Paragraph>
    <Paragraph position="6"> Thus b controls the relative weight between labeled neighbors and unlabeled neighbors; a is roughly the relative weight given to semi-supervised (nondongle) edges.</Paragraph>
    <Paragraph position="7"> We can find the closed-form solution to the optimization problem. Defining an nxn matrix -W,</Paragraph>
    <Paragraph position="9"> Let W = max( -W, -Wlatticetop) be a symmetrized version of this matrix. Let D be a diagonal degree matrix</Paragraph>
    <Paragraph position="11"> Note that we define a node's degree to be the sum of its edge weights. Let [?] = D [?]W be the combinatorial Laplacian matrix. Let C be a diagonal dongle</Paragraph>
    <Paragraph position="13"> This is a quadratic function in f. Setting the gradient to zero, [?]L(f)/[?]f = 0 , we find the minimum loss function</Paragraph>
    <Paragraph position="15"> Because C has strictly positive eigenvalues, the inverse is well defined. All our semi-supervised learning experiments use (7) in what follows.</Paragraph>
    <Paragraph position="16"> Before moving on to experiments, we note an interesting connection to the supervised learning method in (Pang and Lee, 2005), which formulates rating inference as a metric labeling problem (Kleinberg and Tardos, 2002). Consider a special case of our loss function (1) when b = 0 and M - [?]. It is easy to show for labeled nodes j [?] L, the optimal value is the given label: f(xj) = yj. Then the optimization problem decouples into a set of one-dimensional problems, one for each unlabeled node</Paragraph>
    <Paragraph position="18"> The above problem is easy to solve. It corresponds exactly to the supervised, non-transductive version of metric labeling, except we use squared difference while (Pang and Lee, 2005) used absolute difference. Indeed in experiments comparing the two (not reported here), their differences are not statistically significant. From this perspective, our semi-supervised learning method is an extension with interacting terms among unlabeled data.</Paragraph>
  </Section>
  <Section position="5" start_page="47" end_page="49" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We performed experiments using the movie review documents and accompanying 4-class (C = {0,1,2,3}) labels found in the &amp;quot;scale dataset v1.0&amp;quot; available at http://www.cs.cornell.edu/people/pabo/ movie-review-data/ and first used in (Pang and Lee, 2005). We chose 4-class instead of 3-class labeling because it is harder. The dataset is divided into four author-specific corpora, containing 1770, 902, 1307, and 1027 documents. We ran experiments individually for each author. Each document is represented as a {0,1} word-presence vector, normalized to sum to 1.</Paragraph>
    <Paragraph position="1"> We systematically vary labeled set size |L |[?] {0.9n,800,400,200,100,50,25,12,6} to observe the effect of semi-supervised learning. |L |= 0.9n is included to match 10-fold cross validation used by (Pang and Lee, 2005). For each |L |we run 20 trials where we randomly split the corpus into labeled and test (unlabeled) sets. We ensure that all four classes are represented in each labeled set. The same random splits are used for all methods, allowing paired t-tests for statistical significance. All reported results are average test set accuracy.</Paragraph>
    <Paragraph position="2"> We compare our graph-based semi-supervised method with two previously studied methods: regression and metric labeling as in (Pang and Lee, 2005).</Paragraph>
    <Section position="1" start_page="47" end_page="47" type="sub_section">
      <SectionTitle>
4.1 Regression
</SectionTitle>
      <Paragraph position="0"> We ran linear epsilon1-insensitive support vector regression using Joachims' SVMlight package (1999) with all default parameters. The continuous prediction on a test document is discretized for classification. Regression results are reported under the heading 'reg.' Note this method does not use unlabeled data for training.</Paragraph>
    </Section>
    <Section position="2" start_page="47" end_page="48" type="sub_section">
      <SectionTitle>
4.2 Metric labeling
</SectionTitle>
      <Paragraph position="0"> We ran Pang and Lee's method based on metric labeling, using SVM regression as the initial label preference function. The method requires an itemsimilarity function, which is equivalent to our similarity measure wij. Among others, we experimented with PSP-based similarity. For consistency with (Pang and Lee, 2005), supervised metric labeling results with this measure are reported under 'reg+PSP.' Note this method does not use unlabeled data for training either.</Paragraph>
      <Paragraph position="1"> PSPi is defined in (Pang and Lee, 2005) as the percentage of positive sentences in review xi. The similarity between reviews xi,xj is the cosine angle  rating. We identified positive sentences using SVM instead of Na&amp;quot;ive Bayes, but the trend is qualitatively the same as in (Pang and Lee, 2005).</Paragraph>
      <Paragraph position="2"> between the vectors (PSPi,1[?]PSPi) and (PSPj,1[?] PSPj). Positive sentences are identified using a binary classifier trained on a separate &amp;quot;snippet data set&amp;quot; located at the same URL as above. The snippet data set contains 10662 short quotations taken from movie reviews appearing on the rottentomatoes.com Web site. Each snippet is labeled positive or negative based on the rating of the originating review. Pang and Lee (2005) trained a Na&amp;quot;ive Bayes classifier. They showed that PSP is a (noisy) measure for comparing reviews--reviews with low ratings tend to receive low PSP scores, and those with higher ratings tend to get high PSP scores. Thus, two reviews with a high PSP-based similarity are expected to have similar ratings. For our experiments we derived PSP measurements in a similar manner, but using a linear SVM classifier. We observed the same relationship between PSP and ratings (Figure 2).</Paragraph>
      <Paragraph position="3"> The metric labeling method has parameters (the equivalent of k,a in our model). Pang and Lee tuned them on a per-author basis using cross validation but did not report the optimal parameters.</Paragraph>
      <Paragraph position="4"> We were interested in learning a single set of parameters for use with all authors. In addition, since we varied labeled set size, it is convenient to tune c = k/|L|, the fraction of labeled reviews used as neighbors, instead of k. We then used the same c,a for all authors at all labeled set sizes in experiments involving PSP. Because c is fixed, k varies directly with |L |(i.e., when less labeled data is available, our algorithm considers fewer nearby labeled examples). In an attempt to reproduce the findings in (Pang and Lee, 2005), we tuned c,a with cross validation. Tuning ranges are c [?] {0.05,0.1,0.15,0.2,0.25,0.3} and a [?] {0.01,0.1,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,5.0}.</Paragraph>
      <Paragraph position="5"> The optimal parameters we found are c = 0.2 and a = 1.5. (In section 4.4, we discuss an alternative similarity measure, for which we re-tuned these parameters.) Note that we learned a single set of shared parameters for all authors, whereas (Pang and Lee, 2005) tuned k and a on a per-author basis. To demonstrate that our implementation of metric labeling produces comparable results, we also determined the optimal author-specific parameters. Table 1 shows the accuracy obtained over 20 trials with |L |= 0.9n for each author, using SVM regression, reg+PSP using shared c,a parameters, and reg+PSP using author-specific c,a parameters (listed in parentheses). The best result in each row of the table is highlighted in bold. We also show in bold any results that cannot be distinguished from the best result using a paired t-test at the 0.05 level.</Paragraph>
      <Paragraph position="6"> (Pang and Lee, 2005) found that their metric labeling method, when applied to the 4-class data we are using, was not statistically better than regression, though they observed some improvement for authors (c) and (d). Using author-specific parameters, we obtained the same qualitative result, but the improvement for (c) and (d) appears even less significant in our results. Possible explanations for this difference are the fact that we derived our PSP measurements using an SVM classifier instead of an NB classifier, and that we did not use the same range of parameters for tuning. The optimal shared parameters produced almost the same results as the optimal author-specific parameters, and were used in subsequent experiments.</Paragraph>
    </Section>
    <Section position="3" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
4.3 Semi-Supervised Learning
</SectionTitle>
      <Paragraph position="0"> We used the same PSP-based similarity measure and the same shared parameters c = 0.2,a = 1.5 from our metric labeling experiments to perform graph-based semi-supervised learning. The results are reported as 'SSL+PSP.' SSL has three  vs. author-specific parameters, with |L |= 0.9n. additional parameters kprime, b, and M. Again we tuned kprime,b with cross validation. Tuning ranges are kprime [?] {2,3,5,10,20} and b [?] {0.001,0.01,0.1,1.0,10.0}. The optimal parameters are kprime = 5 and b = 1.0. These were used for all authors and for all labeled set sizes. Note that unlike k = c|L|, which decreases as the labeled set size decreases, we let kprime remain fixed for all |L|. We set M arbitrarily to a large number 108 to ensure that the ratings of labeled reviews are respected.</Paragraph>
    </Section>
    <Section position="4" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
4.4 Alternate Similarity Measures
</SectionTitle>
      <Paragraph position="0"> In addition to using PSP as a similarity measure between reviews, we investigated several alternative similarity measures based on the cosine of word vectors. Among these options were the cosine between the word vectors used to train the SVM regressor, and the cosine between word vectors containing only words with high (top 1000 or top 5000) mutual information values. The mutual information is computed with respect to the positive and negative classes in the 10662-document &amp;quot;snippet data set.&amp;quot; Finally, we experimented with using as a similarity measure the cosine between word vectors containing all words, each weighted by its mutual information.</Paragraph>
      <Paragraph position="1"> We found this measure to be the best among the options tested in pilot trial runs using the metric labeling algorithm. Specifically, we scaled the mutual information values such that the maximum value was one. Then, we used these values as weights for the corresponding words in the word vectors. For words in the movie review data set that did not appear in the snippet data set, we used a default weight of zero (i.e., we excluded them. We experimented with setting the default weight to one, but found this led to inferior performance.) We repeated the experiments described in sections 4.2 and 4.3 with the only difference being that we used the mutual-information weighted word vector similarity instead of PSP whenever a similarity measure was required. We repeated the tuning procedures described in the previous sections.</Paragraph>
      <Paragraph position="2"> Using this new similarity measure led to the optimal parameters c = 0.1, a = 1.5, kprime = 5, and b = 10.0. The results are reported under 'reg+WV' and 'SSL+WV,' respectively.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="49" end_page="51" type="metho">
    <SectionTitle>
4.5 Results
</SectionTitle>
    <Paragraph position="0"> We tested the five algorithms for all four authors using each of the nine labeled set sizes. The results are presented in table 2. Each entry in the table represents the average accuracy across 20 trials for an author, a labeled set size, and an algorithm. The best result in each row is highlighted in bold. Any results on the same row that cannot be distinguished from the best result using a paired t-test at the 0.05 level are also bold.</Paragraph>
    <Paragraph position="1"> The results indicate that the graph-based semi-supervised learning algorithm based on PSP similarity (SSL+PSP) achieved better performance than all other methods in all four author corpora when only 200, 100, 50, 25, or 12 labeled documents were available. In 19 out of these 20 learning scenarios, the unlabeled set accuracy by the SSL+PSP algorithm was significantly higher than all other methods. While accuracy generally degraded as we trained on less labeled data, the decrease for the SSL approach was less severe through the mid-range labeled set sizes. SSL+PSP remains among the best methods with only 6 labeled examples.</Paragraph>
    <Paragraph position="2"> Note that the SSL algorithm appears to be quite sensitive to the similarity measure used to form the graph on which it is based. In the experiments where we used mutual-information weighted word vector similarity (reg+WV and SSL+WV), we notice that reg+WV remained on par with reg+PSP at high labeled set sizes, whereas SSL+WV appears significantly worse in most of these cases. It is clear that PSP is the more reliable similarity measure.</Paragraph>
    <Paragraph position="3"> SSL uses the similarity measure in more ways than the metric labeling approaches (i.e., SSL's graph is denser), so it is not surprising that SSL's accuracy would suffer more with an inferior similarity measure. null Unfortunately, our SSL approach did not do as well with large labeled set sizes. We believe this  ods. In each row, we list in bold the best result and any results that cannot be distinguished from it with a paired t-test at the 0.05 level.</Paragraph>
    <Paragraph position="4">  is due to two factors: a) the baseline SVM regressor trained on a large labeled set can achieve fairly high accuracy for this difficult task without considering pairwise relationships between examples; b) PSP similarity is not accurate enough. Gain in variance reduction achieved by the SSL graph is offset by its bias when labeled data is abundant.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML