File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2079_evalu.xml
Size: 16,872 bytes
Last Modified: 2025-10-06 13:59:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2079"> <Title>Examining the Role of Linguistic Knowledge Sources in the Automatic Identification and Classification of Reviews</Title> <Section position="7" start_page="613" end_page="617" type="evalu"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> The baseline classifier. We can now train our baseline polarity classifier on each of the two when training polarity classifiers, but neither yields better results than linear kernels.</Paragraph> <Paragraph position="1"> 14The guidelines come with their polarity dataset. Brie y, a positive review has a rating of [?] 3.5 (out of 5) or [?] 3 (out of 4), whereas a negative review has a rating of [?]2 (out of 5) or [?] 1.5 (out of 4).</Paragraph> <Paragraph position="2"> datasets. Our baseline classifier employs as features the k highest-ranking unigrams according to WLLR, with k/2 features selected from each class. Results with k = 10000 are shown in row 1 of Table 1.15 As we can see, the baseline achieves an accuracy of 87.1% and 82.7% on Datasets A and B, respectively. Note that our result on Dataset A is as strong as that obtained by Pang and Lee (2004) via their subjectivity summarization algorithm, which retains only the subjective portions of a document.</Paragraph> <Paragraph position="3"> As a sanity check, we duplicated Pang et al.'s (2002) baseline in which all unigrams that appear four or more times in the training documents are used as features. The resulting classifier achieves an accuracy of 87.2% and 82.7% for Datasets A and B, respectively. Neither of these results are significantly different from our baseline results.16 Adding higher-order n-grams. The negative results that Pang et al. (2002) obtained when using bigrams as features for their polarity classifier seem to suggest that high-order n-grams are not useful for polarity classification. However, recent research in the related (but arguably simpler) task of text classification shows that a bigram-based text classifier outperforms its unigram-based counterpart (Peng et al., 2003). This prompts us to re-examine the utility of high-order n-grams in polarity classification.</Paragraph> <Paragraph position="4"> In our experiments we consider adding bigrams and trigrams to our baseline feature set. However, since these higher-order n-grams significantly outnumber the unigrams, adding all of them to the feature set will dramatically increase the dimen15We experimented with several values of k and obtained the best result with k = 10000.</Paragraph> <Paragraph position="5"> 16We use two-tailed paired t-tests when performing significance testing, with p set to 0.05 unless otherwise stated. sionality of the feature space and may undermine the impact of the unigrams in the resulting classifier. To avoid this potential problem, we keep the number of unigrams and higher-order n-grams equal. Specifically, we augment the baseline feature set (consisting of 10000 unigrams) with 5000 bigrams and 5000 trigrams. The bigrams and tri-grams are selected based on their WLLR computed over the positive reviews and negative reviews in the training set for each CV run.</Paragraph> <Paragraph position="6"> Results using this augmented feature set are shown in row 2 of Table 1. We see that accuracy rises significantly from 87.1% to 89.2% for Dataset A and from 82.7% to 84.7% for Dataset B.</Paragraph> <Paragraph position="7"> This provides evidence that polarity classification can indeed benefit from higher-order n-grams.</Paragraph> <Paragraph position="8"> Adding dependency relations. While bigrams and trigrams are good at capturing local dependencies, dependency relations can be used to capture non-local dependencies among the constituents of a sentence. Hence, we hypothesized that our n-gram-based polarity classifier would benefit from the addition of dependency-based features.</Paragraph> <Paragraph position="9"> Unlike most previous work on polarity classification, which has largely focused on exploiting adjective-noun (AN) relations (e.g., Dave et al.</Paragraph> <Paragraph position="10"> (2003), Popescu and Etzioni (2005)), we hypothesized that subject-verb (SV) and verb-object (VO) relations would also be useful for the task. The following (one-sentence) review illustrates why.</Paragraph> <Paragraph position="11"> While I really like the actors, the plot is rather uninteresting.</Paragraph> <Paragraph position="12"> A unigram-based polarity classifier could be confused by the simultaneous presence of the positive term like and the negative term uninteresting when classifying this review. However, incorporating the VO relation (like, actors) as a feature may allow the learner to learn that the author likes the actors and not necessarily the movie.</Paragraph> <Paragraph position="13"> In our experiments, the SV, VO and AN relations are extracted from each document by the MINIPAR dependency parser (Lin, 1998). As with n-grams, instead of using all the SV, VO and AN relations as features, we select among them the best 5000 according to their WLLR and re-train the polarity classifier with our n-gram-based feature set augmented by these 5000 dependency-based features. Results in row 3 of Table 1 are somewhat surprising: the addition of dependency-based features does not offer any improvements over the simple n-gram-based classifier.</Paragraph> <Paragraph position="14"> Incorporating manually tagged term polarity.</Paragraph> <Paragraph position="15"> Next, we consider incorporating a set of features that are computed based on the polarity of adjectives. As noted before, we desire a high-precision, high-coverage lexicon. So, instead of exploiting a learned lexicon, we manually develop one.</Paragraph> <Paragraph position="16"> To construct the lexicon, we take Pang et al.'s pool of unprocessed documents (see Section 3), remove those that appear in either Dataset A or Dataset B17, and compile a list of adjectives from the remaining documents. Then, based on heuristics proposed in psycholinguistics18 , we handannotate each adjective with its prior polarity (i.e., polarity in the absence of context). Out of the 45592 adjectives we collected, 3599 were labeled as positive, 3204 as negative, and 38789 as neutral. A closer look at these adjectives reveals that they are by no means domain-dependent despite the fact that they were taken from movie reviews.</Paragraph> <Paragraph position="17"> Now let us consider a simple procedure P for deriving a feature set that incorporates information from our lexicon: (1) collect all the bigrams from the training set; (2) for each bigram that contains at least one adjective labeled as positive or negative according to our lexicon, create a new feature that is identical to the bigram except that each adjective is replaced with its polarity label19; (3) merge the list of newly generated features with the list of bigrams20 and select the top 5000 features from the merged list according to their WLLR.</Paragraph> <Paragraph position="18"> We then repeat procedure P for the trigrams and also the dependency features, resulting in a total of 15000 features. Our new feature set comprises these 15000 features as well as the 10000 unigrams we used in the previous experiments.</Paragraph> <Paragraph position="19"> Results of the polarity classifier that incorporates term polarity information are encouraging (see row 4 of Table 1). In comparison to the classifier that uses only n-grams and dependency-based features (row 3), accuracy increases significantly (p = .1) from 89.2% to 90.4% for Dataset A, and from 84.7% to 86.2% for Dataset B. These results suggest that the classifier has benefited from the 20A newly generated feature could be misleading for the learner if the contextual polarity (i.e., polarity in the presence of context) of the adjective involved differs from its prior polarity (see Wilson et al. (2005)). The motivation behind merging with the bigrams is to create a feature set that is more robust in the face of potentially misleading generalizations. use of features that are less sparse than n-grams.</Paragraph> <Paragraph position="20"> Using objective information. Some of the 25000 features we generated above correspond to n-grams or dependency relations that do not contain subjective information. We hypothesized that not employing these objective features in the feature set would improve system performance.</Paragraph> <Paragraph position="21"> More specifically, our goal is to use procedure P again to generate 25000 subjective features by ensuring that the objective ones are not selected for incorporation into our feature set.</Paragraph> <Paragraph position="22"> To achieve this goal, we first use the following rote-learning procedure to identify objective material: (1) extract all unigrams that appear in objective documents, which in our case are the 2000 non-reviews used in review identification [see Section 3]; (2) from these objective unigrams, we take the best 20000 according to their WLLR computed over the non-reviews and the reviews in the training set for each CV run; (3) repeat steps 1 and 2 separately for bigrams, trigrams and dependency relations; (4) merge these four lists to create our 80000-element list of objective material.</Paragraph> <Paragraph position="23"> Now, we can employ procedure P to get a list of 25000 subjective features by ensuring that those that appear in our 80000-element list are not selected for incorporation into our feature set.</Paragraph> <Paragraph position="24"> Results of our classifier trained using these subjective features are shown in row 5 of Table 1.</Paragraph> <Paragraph position="25"> Somewhat surprisingly, in comparison to row 4, we see that our method for filtering objective features does not help improve performance on the two datasets. We will examine the reasons in the following subsection.</Paragraph> <Section position="1" start_page="615" end_page="617" type="sub_section"> <SectionTitle> 4.3 Discussion and Further Analysis </SectionTitle> <Paragraph position="0"> Using the four types of knowledge sources previously described, our polarity classifier significantly outperforms a unigram-based baseline classifier. In this subsection, we analyze some of these results and conduct additional experiments in an attempt to gain further insight into the polarity classification task. Due to space limitations, we will simply present results on Dataset A below, and show results on Dataset B only in cases where a different trend is observed.</Paragraph> <Paragraph position="1"> The role of feature selection. In all of our experiments we used the best k features obtained via WLLR. An interesting question is: how will these results change if we do not perform feature selection? To investigate this question, we conduct two experiments. First, we train a polarity classifier using all unigrams from the training set. Second, we train another polarity classifier using all unigrams, bigrams, and trigrams. We obtain an accuracy of 87.2% and 79.5% for the first and second experiments, respectively.</Paragraph> <Paragraph position="2"> In comparison to our baseline classifier, which achieves an accuracy of 87.1%, we can see that using all unigrams does not hurt performance, but performance drops abruptly with the addition of all bigrams and trigrams. These results suggest that feature selection is critical when bigrams and trigrams are used in conjunction with unigrams for training a polarity classifier.</Paragraph> <Paragraph position="3"> The role of bigrams and trigrams. So far we have seen that training a polarity classifier using only unigrams gives us reasonably good, though not outstanding, results. Our question, then, is: would bigrams alone do a better job at capturing the sentiment of a document than unigrams? To answer this question, we train a classifier using all bigrams (without feature selection) and obtain an accuracy of 83.6%, which is significantly worse than that of a unigram-only classifier. Similar results were also obtained by Pang et al. (2002).</Paragraph> <Paragraph position="4"> It is possible that the worse result is due to the presence of a large number of irrelevant bigrams.</Paragraph> <Paragraph position="5"> To test this hypothesis, we repeat the above experiment except that we only use the best 10000 bi-grams selected according to WLLR. Interestingly, the resulting classifier gives us a lower accuracy of 82.3%, suggesting that the poor accuracy is not due to the presence of irrelevant bigrams.</Paragraph> <Paragraph position="6"> To understand why using bigrams alone does not yield a good classification model, we examine a number of test documents and find that the feature vectors corresponding to some of these documents (particularly the short ones) have all zeroes in them. In other words, none of the bigrams from the training set appears in these reviews. This suggests that the main problem with the bigram model is likely to be data sparseness. Additional experiments show that the trigram-only classifier yields even worse results than the bigram-only classifier, probably because of the same reason.</Paragraph> <Paragraph position="7"> Nevertheless, these higher-order n-grams play a non-trivial role in polarity classification: we have shown that the addition of bigrams and trigrams selected via WLLR to a unigram-based classifier significantly improves its performance.</Paragraph> <Paragraph position="8"> The role of dependency relations. In the previous subsection we see that dependency relations do not contribute to overall performance on top of bigrams and trigrams. There are two plausible reasons. First, dependency relations are simply not useful for polarity classification. Second, the higher-order n-grams and the dependency-based features capture essentially the same information and so using either of them would be sufficient.</Paragraph> <Paragraph position="9"> To test the first hypothesis, we train a classifier using only 10000 unigrams and 10000 dependency-based features (both selected according to WLLR). For Dataset A, the classifier achieves an accuracy of 87.1%, which is statistically indistinguishable from our baseline result. On the other hand, the accuracy for Dataset B is 83.5%, which is significantly better than the corresponding baseline (82.7%) at the p = .1 level.</Paragraph> <Paragraph position="10"> These results indicate that dependency information is somewhat useful for the task when bigrams and trigrams are not used. So the first hypothesis is not entirely true.</Paragraph> <Paragraph position="11"> So, it seems to be the case that the dependency relations do not provide useful knowledge for polarity classification only in the presence of bigrams and trigrams. This is somewhat surprising, since these n-grams do not capture the non-local dependencies (such as those that may be present in certain SV or VO relations) that should intuitively be useful for polarity classification.</Paragraph> <Paragraph position="12"> To better understand this issue, we again examine a number of test documents. Our initial investigation suggests that the problem might have stemmed from the fact that MINIPAR returns dependency relations in which all the verb in ections are removed. For instance, given the sentence My cousin Paul really likes this long movie, MINIPAR will return the VO relation (like, movie). To see why this can be a problem, consider another sentence I like this long movie. From this sentence, MINIPAR will also extract the VO relation (like, movie). Hence, this same VO relation is capturing two different situations, one in which the author himself likes the movie, and in the other, the author's cousin likes the movie. The over-generalization resulting from these stemmed relations renders dependency information not useful for polarity classification. Additional experiments are needed to determine the role of dependency relations when stemming in MINIPAR is disabled.</Paragraph> <Paragraph position="13"> The role of objective information. Results from the previous subsection suggest that our method for extracting objective materials and removing them from the reviews is not effective in terms of improving performance. To determine the reason, we examine the n-grams and the dependency relations that are extracted from the nonreviews. We find that only in a few cases do these extracted objective materials appear in our set of 25000 features obtained in Section 4.2. This explains why our method is not as effective as we originally thought. We conjecture that more sophisticated methods would be needed in order to take advantage of objective information in polarity classification (e.g., Koppel and Schler (2005)).</Paragraph> </Section> </Section> class="xml-element"></Paper>