File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0408_evalu.xml

Size: 11,036 bytes

Last Modified: 2025-10-06 13:59:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0408">
  <Title>tion with known sentiment terms</Title>
  <Section position="6" start_page="59" end_page="62" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="59" end_page="62" type="sub_section">
      <SectionTitle>
4.1. Comparing SO and SM+SO
</SectionTitle>
      <Paragraph position="0"> In our first set of experiments we manipulated the following parameters:  1. the choice of SO or SM+SO method 2. the choice of n when selecting the n% se- null mantic terms with lowest PMI score in the SM method The tables below show the results of classifying sentence vectors using the unigram features and associated scores produced by SO and SO+SM. We used the 2,600-sentence manually-annotated test set described previously to establish these numbers. Since the data exhibit a strong skew in favor of the positive class label, we measure performance not in terms of accuracy but in terms of average precision and recall across the three class labels, as suggested in (Manning and Schutze 2002).</Paragraph>
      <Paragraph position="1">  on the data. Table 2 presents the results of combining the SM and SO methods for different values of n. The best results are shown in boldface. As a comparison between Table 1 and Table 2 shows, the highest average precision and recall scores were obtained by combining the SM and SO methods. Using SM as a feature selection mechanism also reduces the number of features significantly. While the SO method employed on sentence-level vectors uses 13,000 features, the best-performing SM+SO combination uses only 20% of this feature set, indicating that SM is indeed effective in selecting the most important sentiment-bearing terms.</Paragraph>
      <Paragraph position="2">  We also determined that the positive impact of SM is not just a matter of reducing the number of features. If SO - without the SM feature selection step - is reduced to a comparable number of features by taking the top features according to absolute score, average precision is at 0.4445 and average recall at 0.4464.</Paragraph>
      <Paragraph position="3">  Sentiment terms in top 100 SM terms Sentiment terms in top 100 SO terms excellent, terrible, broke, junk, alright, bargain, grin, highest, exceptional, exceeded, horrible, loved, waste, ok, death, leaking, outstanding, cracked, rebate, warped, hooked, sorry, refuses, excellant, satisfying, died, biggest, competitive, delight, avoid, awful, garbage, loud, okay, competent, upscale, dated, mistake, sucks, superior, high, kill, neither excellent, happy, stylish, sporty, smooth, love, quiet, overall, pleased, plenty, dependable, solid, roomy, safe, good, easy, smaller, luxury, comfortable, style, loaded, space, classy, handling, joy, small, comfort, size, perfect, performance, room, choice, recommended, package, compliments, awesome, unique, fun, holds, comfortably, extremely, value, free, satisfied, little, recommend, limited, great, pleasure Non sentiment terms in top 100 SM terms Non sentiment terms in top 100 SO terms alternative, wont, below, surprisingly, maintained, choosing, comparing, legal, vibration, seemed, claim, demands, assistance, knew, engineering, accelleration, ended, salesperson, performed, started, midsize, site, gonna, lets, plugs, industry, alternator, month, told, vette, 180, powertrain, write, mos, walk, causing, lift, es, segment, $250, 300m, wanna, february, mod, $50, nhtsa, suburbans, manufactured, tiburon, $10, f150, 5000, posted, tt, him, saw, jan, condition, very, handles, milage, definitely, definately, far, drives, shape, color, price, provides, options, driving, rides, sports, heated, ride, sport, forward, expected, fairly, anyone, test, fits, storage, range, family, sedan, trunk, young, weve, black, college, suv, midsize, coupe, 30, shopping, kids, player, saturn, bose, truck, town, am, leather, stereo, car, husband Table 3: the top 100 terms identified by SM and SO Table 3 shows the top 100 terms that were identified by each SM and SO methods. The terms are categorized into sentiment-bearing and nonsentiment bearing terms by human judgment. The two sets seem to differ in both strength and orientation of the identified terms. The SM-identified words have a higher density of negative terms (22 out of 43 versus 2 out of 49 for the SO-identified terms). The SM-identified terms also express sentiment more strongly, but this conclusion is more tentative since it may be a consequence of the higher density of negative terms.</Paragraph>
      <Paragraph position="4"> 4.2. Multiple iterations: increasing the number of seed features by SM+SO In a second set of experiments, we assessed the question of whether it is possible to use multiple iterations of the SM+SO method to gradually build the list of seed words. We do this by adding the top n% of features selected by SM, along with their orientation as determined by SO, to the initial set of seed words. The procedure for this round of experiments is as follows: * take the top n% of features identified by SM (we used n=1 for the reported re- null sults, since preliminary experiments with other values for n did not improve results) * perform SO for these features to determine their orientation * take the top 15.5% negative and top 63% positive (according to class label distribution in the development test set) of the features and add them as negative/positive seed features respectively This iteration increases the number of seed features from the original 10 manually-selected features to a total of 111 seed features.</Paragraph>
      <Paragraph position="5"> With this enhanced set of seed features we then re-ran a subset of the experiments in Table 2. Results are shown in Table 4. Increasing the number of seed features through the SM feature selection method increases precision and recall by several percentage points. In particular, precision and recall for negative sentences are boosted.</Paragraph>
      <Paragraph position="6">  We also confirmed that these results are truly attributable to the use of the SM method for the first iteration. If we take an equivalent number of features with strongest semantic orientation according to the SO method and add them to the list of seed features, our results degrade significantly (the resulting classifier performance is significantly different at the 99.9% level as established by the McNemar test). This is further evidence that SM is indeed an effective method for selecting sentiment terms.</Paragraph>
      <Paragraph position="7"> 4.3. Using the SO classifier to bootstrap a Naive Bayes classifier In a third set of experiments, we tried to improve on the results of the SO classifier by combining it with the bootstrapping approach described in (Nigam et al. 2000). The basic idea here is to use the SO classifier to label a subset of the data DL. This labeled subset of the data is then used to bootstrap  parameters th is trained on the documents in DL.</Paragraph>
      <Paragraph position="8"> (2) This initial classifier is used to estimate a probability distribution over all classes for each of the documents in DU. (EStep) null (3) The labeled and unlabeled data are then used to estimate parameters for a new classifier. (M-Step) Steps 2 and 3 are repeated until convergence is achieved when the difference in the joint probability of the data and the parameters falls below the configurable threshold e between iterations. Another free parameter, l , can be used to control how much weight is given to the unlabeled data.</Paragraph>
      <Paragraph position="9"> For our experiments we used classifiers from the best SM+SO combination (2 iterations at n=30) from Table 4 above to label 30% of the total data. Table 5 shows the average precision and recall numbers for the converged NB classifier.</Paragraph>
      <Paragraph position="10">  In addition to improving average precision and recall, the resulting classifier also has the advantage of producing class probabilities instead of simple scores.  using small sets of labeled data Given infinite resources, we can always annotate enough data to train a classifier using a supervised algorithm that will outperform unsupervised or weakly-supervised methods. Which approach to take depends entirely on how much time and money are available and on the accuracy requirements for the task at hand.</Paragraph>
      <Paragraph position="11">  In this experiment, l was set to 0.1 and e was set to 0.05.  We also experimented with labeling the whole data set with the best of our SO score classifiers, and then training a linear Support Vector Machine classifier on the data. The results were considerably worse than any of the reported numbers, so they are not included in this paper.</Paragraph>
      <Paragraph position="12">  To help situate the precision and recall numbers presented in the tables above, we trained Support Vector Machines (SVMs) using small amounts of labeled data. SVMs were trained with 500, 1000, 2000, and 2500 labeled sentences. Annotating 2500 sentences represents approximately eight person-hours of work. The results can be found in Table 5. We were pleasantly surprised at how well the unsupervised classifiers described above perform in comparison to state-of-the-art supervised methods (albeit trained on small amounts of data).  small numbers of labeled examples 4.5. Results on the movie domain We also performed a small set of experiments on the movie domain using Pang and Lee's 2004 data set. This set consists of 2000 reviews, 1000 each of very positive and very negative reviews. Since this data set is balanced and the task is only a two-way classification between positive and negative reviews, we only report accuracy numbers here.</Paragraph>
      <Paragraph position="13"> accuracy Training data  domain Turney (2002) achieves 66% accuracy on the movie review domain using the PMI-IR algorithm to gather association scores from the web. Pang and Lee (2004) report 87.15% accuracy using a unigram-based SVM classifier combined with subjectivity detection. Aue and Gamon (2005) use a simple linear SVM classifier based on unigrams, combined with LLR-based feature reduction, to achieve 91.4% accuracy. Using the Turney SO method on in-domain data instead of web data achieves 73.95% accuracy (using the same two seed words that Turney does). Using one iteration of SM+SO to increase the number of seed words, followed by finding SO scores for all words with respect to the enhanced seed word set, yields a slightly higher accuracy of 74.85%. With additional parameter tuning, this number can be pushed to 76.4%, at which point we achieve statistical significance at the 0.95 level according to the McNemar test, indicating that there is more room here for improvement. Any reduction of the number of overall features in this domain leads to decreased accuracy, contrary to what we observed in the car review domain. We attribute this observation to the smaller data set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML