File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1652_metho.xml
Size: 12,815 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1652"> <Title>Feature Subsumption for Opinion Analysis</Title> <Section position="5" start_page="443" end_page="443" type="metho"> <SectionTitle> 3 Data Sets </SectionTitle> <Paragraph position="0"> We used three opinion-related data sets for our analyses and experiments: the OP data set created by (Wiebe et al., 2004), the Polarity data set5 created by (Pang and Lee, 2004), and the MPQA data set created by (Wiebe et al., 2005).6 The OP and Polarity data sets involve document-level opinion classi cation, while the MPQA data set involves sentence-level classi cation.</Paragraph> <Paragraph position="1"> The OP data consists of 2,452 documents from the Penn Treebank (Marcus et al., 1993). Metadata tags assigned by the Wall Street Journal de ne the opinion/non-opinion classes: the class of any document labeled Editorial, Letter to the Editor, Arts & Leisure Review, or Viewpoint by the Wall Street Journal is opinion, and the class of documents in all other categories (such as Business and News) is non-opinion. This data set is highly skewed, with only 9% of the documents belonging to the opinion class. Consequently, a trivial (but useless) opinion classi er that labels all documents as non-opinion articles would achieve 91% accuracy.</Paragraph> <Paragraph position="2"> The Polarity data consists of 700 positive and 700 negative reviews from the Internet Movie Database (IMDb) archive. The positive and negative classes were derived from author ratings expressed in stars or numerical values. The MPQA data consists of English language versions of articles from the world press. It contains 9,732 sentences that have been manually annotated for subjective expressions. The opinion/non-opinion classes are derived from the lower-level annotations: a sentence is an opinion if it contains a subjective expression of medium or higher intensity; otherwise, it is a non-opinion sentence. 55% of the sentences belong to the opinion class.</Paragraph> </Section> <Section position="6" start_page="443" end_page="445" type="metho"> <SectionTitle> 4 Using the Subsumption Hierarchy for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="443" end_page="445" type="sub_section"> <SectionTitle> Analysis </SectionTitle> <Paragraph position="0"> In this section, we illustrate how the subsumption hierarchy can be used as an analytic tool to automatically identify features that substantially outperform simpler counterparts. These features represent specialized usages and expressions that would be good candidates for addition to a subjectivity lexicon. Figure 6 shows pairs of features, where the rst is more general and the second is more speci c. These feature pairs were identi ed by the subsumption hierarchy as being representationally similar but behaviorally different (so the more speci c feature was retained). The IGain column shows the information gain values produced from the training set of one cross-validation fold. The Class column shows the class that the more speci c feature is correlated with (the more general feature is usually not strongly correlated with either class).</Paragraph> <Paragraph position="1"> The top table in Figure 6 contains examples for the opinion/non-opinion classi cation task from (1 ) unigram; 2 ) bigram; EP ) extraction pattern) the OP data. The more speci c features are more strongly correlated with opinion articles. Surprisingly, simply adding a determiner can dramatically change behavior. Consider A2. There are many subjective idioms involving the line (two are shown in the table; others include toe the line and draw the line ), while objective language about credit lines, phone lines, etc. uses the determiner less often. Similarly, consider B2. Adding a to nation often corresponds to an abstract reference used when making an argument (e.g., a nation of ascetics ), whereas other instances of nation are used more literally (e.g., the 6th largest in the nation ). 21% of feature B1's instances appear in opinion articles, while 70% of feature B2's instances are in opinion articles.</Paragraph> <Paragraph position="2"> Begin with (C2) captures an adverbial phrase used in argumentation ( To begin with... ) but does not match objective usages such as will begin an action. The word bene ts alone (D1) matches phrases like tax bene ts and employee bene ts that are not opinion expressions, while DEP typically matches positive senses of the word bene ts . Interestingly, the bigram bene ts to is not highly correlated with opinions because it matches in nitive phrases such as tax bene ts to provide and health bene ts to cut . In this case, the extraction pattern NP Prep(bene ts to) is more discriminating than the bigram for opinion classi cation. The extraction pattern EEP is also highly correlated with opinions, while the unigram due and the bigram due to are not.</Paragraph> <Paragraph position="3"> The bottom table in Figure 6 shows feature pairs identi ed for their behavioral differences on the Polarity data set, where the task is to distinguish positive reviews from negative reviews. F2 and G2 are bigrams that behave differently from their component unigrams. The expression nothing short (of) is typically used to express positive sentiments, while nothing and short by themselves are not. The word ugly is often used as a descriptive modi er that is not expressing a sentiment per se, while and ugly appears in predicate adjective constructions that are expressing a negative sentiment. The extraction pattern HEP is more discriminatory than H1 because it distinguishes negative sentiments ( the lm is a disaster! ) from plot descriptions ( the disaster movie... ). IEP shows that active-voice usages of work are strong positive indicators, while the unigram work appears in a variety of both positive and negative contexts. Finally, JEP shows that the expression manages to keep is a strong positive indicator, while manages by itelf is much less discriminating.</Paragraph> <Paragraph position="4"> These examples illustrate that the subsumption hierarchy can be a powerful tool to better understand the behaviors of different kinds of features, and to identify speci c features that may be desirable for inclusion in specialized lexical resources.</Paragraph> </Section> </Section> <Section position="7" start_page="445" end_page="447" type="metho"> <SectionTitle> 5 Using the Subsumption Hierarchy to Reduce Feature Sets </SectionTitle> <Paragraph position="0"> When creating opinion classi ers, people often throw in a variety of features and trust the machine learning algorithm to gure out how to make the best use of them. However, we hypothesized that classi ers may perform better if we can proactively eliminate features that are not necesary because they are subsumed by other features. In this section, we present a series of experiments to explore this hypothesis. First, we present the results for an SVM classi er trained using different sets of unigram, bigram, and extraction pattern features, both before and after subsumption. Next, we evaluate a standard feature selection approach as an alternative to subsumption and then show that combining subsumption with standard feature selection produces the best results of all.</Paragraph> <Section position="1" start_page="445" end_page="446" type="sub_section"> <SectionTitle> 5.1 Classi cation Experiments </SectionTitle> <Paragraph position="0"> To see whether feature subsumption can improve classi cation performance, we trained an SVM classi er for each of the three opinion data sets.</Paragraph> <Paragraph position="1"> We used the SVMlight (Joachims, 1998) package with a linear kernel. For the Polarity and OP data we discarded all features that have frequency < 5, and for the MPQA data we discarded features that have frequency < 2 because this data set is substantially smaller. All of our experimental results are averages over 3-fold cross-validation.</Paragraph> <Paragraph position="2"> First, we created 4 baseline classi ers: a 1Gram classi er that uses only the unigram features; a 1+2Gram classi er that uses unigram and bigram features; a 1+EP classi er that uses unigram and extraction pattern features, and a 1+2+EP classier that uses all three types of features. Next, we created analogous 1+2Gram, 1+EP, and 1+2+EP classi ers but applied the subsumption hierarchy rst to eliminate unnecessary features before training the classi er. We experimented with three delta values for the subsumption process: d=.0005, .001, and .002.</Paragraph> <Paragraph position="3"> Figures 7, 8, and 9 show the results. The subsumption process produced small but consistent improvements on all 3 data sets. For example, Figure 8 shows the results on the OP data, where all of the accuracy values produced after subsumption (the rightmost 3 columns) are higher than the accuracy values produced without subsumption (the Base[line] column). For all three data sets, the best overall accuracy (shown in boldface) was always achieved after subsumption.</Paragraph> <Paragraph position="4"> We also observed that subsumption had a dramatic effect on the F-measure scores on the OP data, which are shown in Figure 10. The OP data set is fundamentally different from the other data sets because it is so highly skewed, with 91% of the documents belonging to the non-opinion class.</Paragraph> <Paragraph position="5"> Without subsumption, the classi er was conservative about assigning documents to the opinion class, achieving F-measure scores in the 82-88 range. After subsumption, the overall accuracy improved but the F-measure scores increased more dramatically. These numbers show that the subsumption process produced not only a more accurate classi er, but a more useful classi er that identi es more documents as being opinion articles. null For the MPQA data, we get a very small improvement of 0.1% (74.8% ! 74.9%) using subsumption. But note that without subsumption the performance actually decreased when bigrams and extraction patterns were added! The subsumption process counteracted the negative effect of adding the more complex features.</Paragraph> </Section> <Section position="2" start_page="446" end_page="447" type="sub_section"> <SectionTitle> 5.2 Feature Selection Experiments </SectionTitle> <Paragraph position="0"> We conducted a second series of experiments to determine whether a traditional feature selection approach would produce the same, or better, improvements as subsumption. For each feature, we computed its information gain (IG) and then selected the N features with the highest scores.7 We experimented with values of N ranging from 1,000 to 10,000 in increments of 1,000.</Paragraph> <Paragraph position="1"> We hypothesized that applying subsumption before traditional feature selection might also help to identify a more diverse set of high-performing features. In a parallel set of experiments, we explored this hypothesis by rst applying subsumption to reduce the size of the feature set, and then selecting the best N features using information gain.</Paragraph> <Paragraph position="2"> Figures 11, 12, and 13 show the results of these experiments for the 1+2+EP classi ers. Each graph shows four lines. One line corresponds to the baseline classi er with no subsumption, and another line corresponds to the baseline classi er with subsumption using the best d value for that data set. Each of these two lines corresponds to just a single data point (accuracy value), but we drew that value as a line across the graph for the sake of comparison. The other two lines on the graph correspond to (a) feature selection for different values of N (shown on the x-axis), and (b) subsumption followed by feature selection for different values of N.</Paragraph> <Paragraph position="3"> On all 3 data sets, traditional feature selection performs worse than the baseline in some cases, and it virtually never outperforms the best classier trained after subsumption (but without feature selection). Furthermore, the combination of subsumption plus feature selection generally performs best of all, and nearly always outperforms feature selection alone. For all 3 data sets, our best accuracy results were achieved by performing subsumption prior to feature selection. The best accuracy results are 99.0% on the OP data, 83.1% on the Polarity data, and 75.4% on the MPQA data.</Paragraph> <Paragraph position="4"> For the OP data, the improvement over baseline for both accuracy and F-measure are statistically signi cant at the p < 0.05 level (paired t-test). For the MPQA data, the improvement over baseline is statistically signi cant at the p < 0.10 level.</Paragraph> </Section> </Section> class="xml-element"></Paper>