File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2079_metho.xml
Size: 9,180 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2079"> <Title>Examining the Role of Linguistic Knowledge Sources in the Automatic Identification and Classification of Reviews</Title> <Section position="5" start_page="612" end_page="613" type="metho"> <SectionTitle> 3 Review Identification </SectionTitle> <Paragraph position="0"> Recall that the goal of review identification is to determine whether a given document is a review or not. Given this definition, two immediate questions come to mind. First, should this problem be addressed in a domain-specific or domain-independent manner? In other words, should a review identification system take as input documents coming from the same domain or not? Apparently this is a design question with no definite answer, but our decision is to perform domain-specific review identification. The reason is that the primary motivation of review identification is the need to identify reviews for further analysis by a polarity classification system. Since polarity classification has almost exclusively been addressed in a domain-specific fashion, it seems natural that its immediate upstream component review identification should also assume domain specificity. Note, however, that assuming domain specificity is not a self-imposed limitation. In fact, we envision that the review identification system will have as its upstream component a text classification system, which will classify documents by topic and pass to the review identifier only those documents that fall within its domain.</Paragraph> <Paragraph position="1"> Given our choice of domain specificity, the next question is: which documents are non-reviews? Here, we adopt a simple and natural definition: a non-review is any document that belongs to the given domain but is not a review.</Paragraph> <Paragraph position="2"> Dataset. Now, recall from the introduction that we cast review identification as a classification task. To train and test our review identifier, we use 2000 reviews and 2000 non-reviews from the movie domain. The 2000 reviews are taken from Pang et al.'s polarity dataset (version 2.0)3, which consists of an equal number of positive and negative reviews. We collect the non-reviews for the people/pabo/movie-review-data.</Paragraph> <Paragraph position="3"> movie domain from the Internet Movie Database website4, randomly selecting any documents from this site that are on the movie topic but are not reviews themselves. With this criterion in mind, the 2000 non-review documents we end up with are either movie ads or plot summaries.</Paragraph> <Paragraph position="4"> Training and testing the review identifier. We perform 10-fold cross-validation (CV) experiments on the above dataset, using Joachims' (1999) SVMlight package5 to train an SVM classifier for distinguishing reviews and non-reviews. All learning parameters are set to their default values.6 Each document is first tokenized and downcased, and then represented as a vector of unigrams with length normalization.7 Following Pang et al. (2002), we use frequency as presence.</Paragraph> <Paragraph position="5"> In other words, the ith element of the document vector is 1 if the corresponding unigram is present in the document and 0 otherwise. The resulting classifier achieves an accuracy of 99.8%.</Paragraph> <Paragraph position="6"> Classifying neutral reviews and non-reviews.</Paragraph> <Paragraph position="7"> Admittedly, the high accuracy achieved using such a simple set of features is somewhat surprising, although it is consistent with previous results on document-level subjectivity classification in which accuracies of 94-97% were obtained (Yu and Hatzivassiloglou, 2003; Wiebe et al., 2004).</Paragraph> <Paragraph position="8"> Before concluding that review classification is an easy task, we conduct an additional experiment: we train a review identifier on a new dataset where we keep the same 2000 non-reviews but replace the positive/negative reviews with 2000 neutral reviews (i.e., reviews with a mediocre rating). Intuitively, a neutral review contains fewer terms with strong polarity than a positive/negative review. Hence, this additional experiment would allow us to investigate whether the lack of strong polarized terms in neutral reviews would increase the difficulty of the learning task.</Paragraph> <Paragraph position="9"> Our neutral reviews are randomly chosen from Pang et al.'s pool of 27886 unprocessed movie reviews8 that have either a rating of 2 (on a 4-point scale) or 2.5 (on a 5-point scale). Each review then undergoes a semi-automatic preprocessing stage where (1) HTML tags and any header and trailer information (such as date and author identity) are removed; (2) the document is tokenized and downcased; (3) the rating information extracted by regular expressions is removed; and (4) the document is manually checked to ensure that the rating information is successfully removed. When trained on this new dataset, the review identifier also achieves an accuracy of 99.8%, suggesting that this learning task isn't any harder in comparison to the previous one.</Paragraph> <Paragraph position="10"> Discussion. We hypothesized that the high accuracies are attributable to the different vocabulary used in reviews and non-reviews. As part of our verification of this hypothesis, we plot the learning curve for each of the above experiments.9 We observe that a 99% accuracy was achieved in all cases even when only 200 training instances are used to acquire the review identifier. The ability to separate the two classes with such a small amount of training data seems to imply that features strongly indicative of one or both classes are present. To test this hypothesis, we examine the informative features for both classes. To get these informative features, we rank the features by their weighted log-likelihood ratio (WLLR)10:</Paragraph> <Paragraph position="12"> where wt and cj denote the tth word in the vocabulary and the jth class, respectively. Informally, a feature (in our case a unigram) w will have a high rank with respect to a class c if it appears frequently in c and infrequently in other classes. This correlates reasonably well with what we think an informative feature should be. A closer examination of the feature lists sorted by WLLR confirms our hypothesis that each of the two classes has its own set of distinguishing features.</Paragraph> <Paragraph position="13"> Experiments with the book domain. To understand whether these good review identification results only hold true for the movie domain, we conduct similar experiments with book reviews and non-reviews. Specifically, we collect 1000 book reviews (consisting of a mixture of positive, negative, and neutral reviews) from the Barnes tive at selecting good features for text classification. Other commonly-used feature selection metrics are discussed in Yang and Pedersen (1997).</Paragraph> <Paragraph position="14"> and Noble website11, and 1000 non-reviews that are on the book topic (mostly book summaries) from Amazon.12 We then perform 10-fold CV experiments using these 2000 documents as before, achieving a high accuracy of 96.8%. These results seem to suggest that automatic review identification can be achieved with high accuracy.</Paragraph> </Section> <Section position="6" start_page="613" end_page="613" type="metho"> <SectionTitle> 4 Polarity Classification </SectionTitle> <Paragraph position="0"> Compared to review identification, polarity classification appears to be a much harder task. This section examines the role of various linguistic knowledge sources in our learning-based polarity classification system.</Paragraph> <Section position="1" start_page="613" end_page="613" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> Like several previous work (e.g., Mullen and Collier (2004), Pang and Lee (2004), Whitelaw et al.</Paragraph> <Paragraph position="1"> (2005)), we view polarity classification as a supervised learning task. As in review identification, we use SVMlight with default parameter settings to train polarity classifiers13, reporting all results as 10-fold CV accuracy.</Paragraph> <Paragraph position="2"> We evaluate our polarity classifiers on two movie review datasets, each of which consists of 1000 positive reviews and 1000 negative reviews.</Paragraph> <Paragraph position="3"> The first one, which we will refer to as Dataset A, is the Pang et al. polarity dataset (version 2.0). The second one (Dataset B) was created by us, with the sole purpose of providing additional experimental results. Reviews in Dataset B were randomly chosen from Pang et al.'s pool of 27886 unprocessed movie reviews (see Section 3) that have either a positive or a negative rating. We followed exactly Pang et al.'s guideline when determining whether a review is positive or negative.14 Also, we took care to ensure that reviews included in Dataset B do not appear in Dataset A. We applied to these reviews the same four pre-processing steps that we did to the neutral reviews in the previous section.</Paragraph> </Section> </Section> class="xml-element"></Paper>