File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2079_intro.xml
Size: 4,290 bytes
Last Modified: 2025-10-06 14:03:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2079"> <Title>Examining the Role of Linguistic Knowledge Sources in the Automatic Identification and Classification of Reviews</Title> <Section position="3" start_page="0" end_page="611" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Sentiment analysis involves the identification of positive and negative opinions from a text segment. The task has recently received a lot of attention, with applications ranging from multi-perspective question-answering (e.g., Cardie et al. (2004)) to opinion-oriented information extraction (e.g., Riloff et al. (2005)) and summarization (e.g., Hu and Liu (2004)). Research in sentiment analysis has generally proceeded at three levels, aiming to identify and classify opinions from documents, sentences, and phrases. This paper examines two problems in document-level sentiment analysis, focusing on analyzing a particular type of opinionated documents: reviews.</Paragraph> <Paragraph position="1"> The first problem, polarity classification, has the goal of determining a review's polarity positive ( thumbs up ) or negative ( thumbs down ).</Paragraph> <Paragraph position="2"> Recent work has expanded the polarity classification task to additionally handle documents expressing a neutral sentiment. Although studied fairly extensively, polarity classification remains a challenge to natural language processing systems.</Paragraph> <Paragraph position="3"> We will focus on an important linguistic aspect of polarity classification: examining the role of a variety of simple, yet under-investigated, linguistic knowledge sources in a learning-based polarity classification system. Specifically, we will show how to build a high-performing polarity classifier by exploiting information provided by (1) high order n-grams, (2) a lexicon composed of adjectives manually annotated with their polarity information (e.g., happy is annotated as positive and terrible as negative), (3) dependency relations derived from dependency parses, and (4) objective terms and phrases extracted from neutral documents.</Paragraph> <Paragraph position="4"> As mentioned above, the majority of work on document-level sentiment analysis to date has focused on polarity classification, assuming as input a set of reviews to be classified. A relevant question is: what if we don't know that an input document is a review in the first place? The second task we will examine in this paper review identification attempts to address this question.</Paragraph> <Paragraph position="5"> Specifically, review identification seeks to determine whether a given document is a review or not.</Paragraph> <Paragraph position="6"> We view both review identification and polarity classification as a classification task. For review identification, we train a classifier to distinguish movie reviews and movie-related non-reviews (e.g., movie ads, plot summaries) using only unigrams as features, obtaining an accuracy of over 99% via 10-fold cross-validation. Similar experiments using documents from the book domain also yield an accuracy as high as 97%.</Paragraph> <Paragraph position="7"> An analysis of the results reveals that the high accuracy can be attributed to the difference in the vocabulary employed in reviews and non-reviews: while reviews can be composed of a mixture of subjective and objective language, our non-review documents rarely contain subjective expressions.</Paragraph> <Paragraph position="8"> Next, we learn our polarity classifier using positive and negative reviews taken from two movie review datasets, one assembled by Pang and Lee (2004) and the other by ourselves. The resulting classifier, when trained on a feature set derived from the four types of linguistic knowledge sources mentioned above, achieves a 10-fold cross-validation accuracy of 90.5% and 86.1% on Pang et al.'s dataset and ours, respectively. To our knowledge, our result on Pang et al.'s dataset is one of the best reported to date. Perhaps more importantly, an analysis of these results show that the various types of features interact in an interesting manner, allowing us to draw conclusions that provide new insights into polarity classification.</Paragraph> </Section> class="xml-element"></Paper>