File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1034_metho.xml

Size: 21,097 bytes

Last Modified: 2025-10-06 14:08:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1034">
  <Title>The Sentimental Factor: Improving Review Classification via Human-Provided Information</Title>
  <Section position="3" start_page="0" end_page="2" type="metho">
    <SectionTitle>
2 A Logistic Model for Sentiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Turney's Sentiment Classifier
</SectionTitle>
      <Paragraph position="0"> In Turney's model, the &amp;quot;sentiment orientation&amp;quot; of word w is estimated as follows.</Paragraph>
      <Paragraph position="2"> Here, Na is the total number of sites on the Internet that contain an occurrence of a - a feature that can be a word type or a phrase. N(w;a) is the number of sites in which features w and a appear &amp;quot;near&amp;quot; each other, i.e. in the same passage of text, within a span of ten words. Both numbers are obtained from the hit count that results from a query of the AltaVista search engine. The rationale for this estimate is that words that express similar sentiment often co-occur, while words that express conflicting sentiment co-occur more rarely. Thus, a word that co-occurs more frequently with excellent than poor is estimated to have a positive sentiment orientation.</Paragraph>
      <Paragraph position="3"> To extrapolate from words to documents, the estimated sentiment ^s 2 f 1; 1g of a review document d is the sign of the average sentiment orientation of its constituent features.1 To represent this estimate formally, we introduce the following notation: W is a &amp;quot;dictionary&amp;quot; of features: (w1;::: ;wp). Each feature's respective sentiment orientation is represented as an entry in the vector ^ of length p:</Paragraph>
      <Paragraph position="5"> Given a collection of n review documents, the i-th each di is also represented as a vector of length p, with dij equal to the number of times that feature wj occurs in di. The length of a document is its total number of features, jdij = Ppj=1 dij.</Paragraph>
      <Paragraph position="6"> Turney's classifier for the i-th document's sentiment si can now be written:</Paragraph>
      <Paragraph position="8"> Using a carefully chosen collection of features, this classifier produces correct results on 65.8% of a collection of 120 movie reviews, where 60 are labeled positive and 60 negative. Although this is not a particularly encouraging result, movie reviews tend to be a difficult domain. Accuracy on sentiment classification in other domains exceeds 80% (Turney, 2002).</Paragraph>
      <Paragraph position="9"> 1Note that not all words or phrases need to be considered as features. In Turney (2002), features are selected according to part-of-speech labels.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Naive Bayes Classification
</SectionTitle>
      <Paragraph position="0"> Bayes' Theorem provides a convenient framework for predicting a binary response s 2 f 1; 1g from a feature vector x:</Paragraph>
      <Paragraph position="2"> For a labeled sample of data (xi;si);i = 1;:::;n, a class's marginal probability k can be estimated trivially as the proportion of training samples belonging to the class. Thus the critical aspect of classification by Bayes' Theorem is to estimate the conditional distribution of x given s. Naive Bayes simplifies this problem by making a &amp;quot;naive&amp;quot; assumption: within a class, the different feature values are taken to be independent of one another.</Paragraph>
      <Paragraph position="4"> As a result, the estimation problem is reduced to univariate distributions.</Paragraph>
      <Paragraph position="5"> Naive Bayes for a Multinomial Distribution We consider a &amp;quot;bag of words&amp;quot; model for a document that belongs to class k, where features are assumed to result from a sequence of jdij independent multinomial draws with outcome probability vector qk = (qk1;::: ;qkp).</Paragraph>
      <Paragraph position="6"> Given a collection of documents with labels, (di;si);i = 1;::: ;n, a natural estimate for qkj is the fraction of all features in documents of class k that equal wj:</Paragraph>
      <Paragraph position="8"> In the two-class case, the logit transformation provides a revealing representation of the class posterior probabilities of the Naive Bayes model.</Paragraph>
      <Paragraph position="9"> dlogit(sjd) , log cPr(s = 1jd)c</Paragraph>
      <Paragraph position="11"> Observe that the estimate for the logit in Equation 9 has a simple structure: it is a linear function of d. Models that take this form are commonplace in classification.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Turney's Classifier as Naive Bayes
</SectionTitle>
      <Paragraph position="0"> Although Naive Bayes classification requires a labeled corpus of documents, we show in this section that Turney's approach corresponds to a Naive Bayes model. The necessary documents and their corresponding labels are built from the spans of text that surround the anchor words excellent and poor.</Paragraph>
      <Paragraph position="1"> More formally, a labeled corpus may be produced by the following procedure:  1. For a particular anchor ak, locate all of the sites on the Internet where it occurs.</Paragraph>
      <Paragraph position="2"> 2. From all of the pages within a site, gather the features that occur within ten words of an occurrence of ak, with any particular feature included at most once. This list comprises a new &amp;quot;document,&amp;quot; representing that site.2 3. Label this document +1 if ak = excellent, -1  if ak = poor.</Paragraph>
      <Paragraph position="3"> When a Naive Bayes model is fit to the corpus described above, it results in a vector ^ of length p, consisting of coefficient estimates for all features. In Propositions 1 and 2 below, we show that Turney's estimates of sentiment orientation ^ are closely related to ^ , and that both estimates produce identical classifiers.</Paragraph>
      <Paragraph position="5"> Proof: Because a feature is restricted to at most one occurrence in a document,</Paragraph>
      <Paragraph position="7"> observation to its most probable class. This is equivalent to classifying according to the sign of the estimated logit. So for any document, we must show that both the logit estimate and the average sentiment orientation are identical in sign.</Paragraph>
      <Paragraph position="8"> When 1 = 0:5, 0 = 0. Thus the estimated logit</Paragraph>
      <Paragraph position="10"> This is a positive multiple of Turney's classifier (Equation 3), so they clearly match in sign. 2</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 A More Versatile Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Desired Extensions
</SectionTitle>
      <Paragraph position="0"> By understanding Turney's model within a Naive Bayes framework, we are able to interpret its output as a probability model for document classes. In the presence of labeled examples, this insight also makes it possible to estimate the intercept term 0.</Paragraph>
      <Paragraph position="1"> Further, we are able to view this model as a member of a broad class: linear estimates for the logit.</Paragraph>
      <Paragraph position="2"> This understanding facilitates further extensions, in particular, utilizing the following:  1. Labeled documents 2. More anchor words  The reason for using labeled documents is straightforward; labels offer validation for any chosen model. Using additional anchors is desirable in part because it is inexpensive to produce lists of words that are believed to reflect positive sentiment, perhaps by reference to a thesaurus. In addition, a single anchor may be at once too general and too specific.</Paragraph>
      <Paragraph position="3"> An anchor may be too general in the sense that many common words have multiple meanings, and not all of them reflect a chosen sentiment orientation. For example, poor can refer to an objective economic state that does not necessarily express negative sentiment. As a result, a word such as income appears 4.18 times as frequently with poor as excellent, even though it does not convey negative sentiment. Similarly, excellent has a technical meaning in antiquity trading, which causes it to appear 3.34 times as frequently with furniture.</Paragraph>
      <Paragraph position="4"> An anchor may also be too specific, in the sense that there are a variety of different ways to express sentiment, and a single anchor may not capture them all. So a word like pretentious carries a strong negative sentiment but co-occurs only slightly more frequently (1.23 times) with excellent than poor. Likewise, fascination generally reflects a positive sentiment, yet it appears slightly more frequently (1.06 times) with poor than excellent.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Other Sources of Unlabeled Data
</SectionTitle>
      <Paragraph position="0"> The use of additional anchors has a drawback in terms of being resource-intensive. A feature set may contain many words and phrases, and each of them requires a separate AltaVista query for every chosen anchor word. In the case of 30,000 features and ten queries per minute, downloads for a single anchor word require over two days of data collection.</Paragraph>
      <Paragraph position="1"> An alternative approach is to access a large collection of documents directly. Then all co-occurrences can be counted in a single pass.</Paragraph>
      <Paragraph position="2"> Although this approach dramatically reduces the amount of data available, it does offer several advantages. null Increased Query Options Search engine queries of the form phrase NEAR anchor may not produce all of the desired co-occurrence counts. For instance, one may wish to run queries that use stemmed words, hyphenated words, or punctuation marks. One may also wish to modify the definition of NEAR, or to count individual co-occurrences, rather than counting sites that contain at least one co-occurrence.</Paragraph>
      <Paragraph position="3"> Topic Matching Across the Internet as a whole, features may not exhibit the same correlation structure as they do within a specific domain. By restricting attention to documents within a domain, one may hope to avoid co-occurrences that are primarily relevant to other subjects.</Paragraph>
      <Paragraph position="4"> Reproducibility On a fixed corpus, counts of word occurrences produce consistent results.</Paragraph>
      <Paragraph position="5"> Due to the dynamic nature of the Internet, numbers may fluctuate.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Co-Occurrences and Derived Features
</SectionTitle>
      <Paragraph position="0"> The Naive Bayes coefficient estimate ^ j may itself be interpreted as an intercept term plus a linear combination of features of the form log N(wj;ak).</Paragraph>
      <Paragraph position="1"> Num. of Labeled Occurrences Correlation</Paragraph>
      <Paragraph position="3"> We generalize this estimate as follows: for a collection of K different anchor words, we consider a general linear combination of logged co-occurrence counts.</Paragraph>
      <Paragraph position="5"> In the special case of a Naive Bayes model, k = 1 when the k-th anchor word ak conveys positive sentiment, 1 when it conveys negative sentiment.</Paragraph>
      <Paragraph position="6"> Replacing the logit estimate in Equation 9 with an estimate of this form, the model becomes:</Paragraph>
      <Paragraph position="8"> This model has only K + 1 parameters: 0; 1;::: ; K. These can be learned straightforwardly from labeled documents by a method such as logistic regression.</Paragraph>
      <Paragraph position="9"> Observe that a document receives a score for each anchor word Ppj=1 dj log N(wj;ak). Effectively, the predictor variables in this model are no longer counts of the original features dj. Rather, they are</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
cient Estimates
</SectionTitle>
      <Paragraph position="0"> inner products between the entire feature vector d and the logged co-occurence vector N(w;ak). In this respect, the vector of logged co-occurrences is used to produce derived feature.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Data Analysis
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Accuracy of Unsupervised Coefficients
</SectionTitle>
      <Paragraph position="0"> By means of a Perl script that uses the Lynx browser, Version 2.8.3rel.1, we download AltaVista hit counts for queries of the form &amp;quot;target NEAR anchor.&amp;quot; The initial list of targets consists of 44,321 word types extracted from the Pang corpus of 1400 labeled movie reviews. After preprocessing, this number is reduced to 28,629.3 In Figure 1, we compare estimates produced by two Naive Bayes procedures. For each feature wj, we estimate j by using Turney's procedure, and by fitting a traditional Naive Bayes model to the labeled documents. The traditional estimates are smoothed by assuming a Beta prior distribution that is equivalent to having four previous observations of wj in documents of each class.</Paragraph>
      <Paragraph position="2"> Here, dij is used to indicate feature presence:</Paragraph>
      <Paragraph position="4"> 3We eliminate extremely rare words by requiring each target to co-occur at least once with each anchor. In addition, certain types, such as words containing hyphens, apostrophes, or other punctuation marks, do not appear to produce valid counts, so they are discarded.</Paragraph>
      <Paragraph position="5">  We choose this fitting procedure among several candidates because it performs well in classifying test documents.</Paragraph>
      <Paragraph position="6"> In Figure 1, each entry in the right-hand column is the observed correlation between these two estimates over a subset of features. For features that occur in five documents or fewer, the correlation is very weak (0.022). This is not surprising, as it is difficult to estimate a coefficient from such a small number of labeled examples. Correlations are stronger for more common features, but never strong. As a baseline for comparison, Naive Bayes coefficients can be estimated using a subset of their labeled occurrences. With two independent sets of 51-75 occurrences, Naive Bayes coefficient estimates had a correlation of 0.475.</Paragraph>
      <Paragraph position="7"> Figure 2 is a scatterplot of the same coefficient estimates for word types that appear in 51 to 100 documents. The great majority of features do not have large coefficients, but even for the ones that do, there is not a tight correlation.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Additional Anchors
</SectionTitle>
      <Paragraph position="0"> We wish to learn how our model performance depends on the choice and number of anchor words.</Paragraph>
      <Paragraph position="1"> Selecting from WordNet synonym lists (Fellbaum, 1998), we choose five positive anchor words and five negative (Figure 3). This produces a total of 25 different possible pairs for use in producing co-efficient estimates.</Paragraph>
      <Paragraph position="2"> Figure 4 shows the classification performance of unsupervised procedures using the 1400 labeled Pang documents as test data. Coefficients ^ j are estimated as described in Equation 22. Several different experimental conditions are applied. The methods labeled &amp;quot;Count&amp;quot; use the original un-normalized coefficients, while those labeled &amp;quot;Norm.&amp;quot; have been normalized so that the number of co-occurrences with each anchor have identical variance. Results are shown when rare words (with three or fewer occurrences in the labeled corpus) are included and omitted. The methods &amp;quot;pair&amp;quot; and &amp;quot;10&amp;quot; describe whether all ten anchor coefficients are used at once, or just the ones that correspond to a single pair of</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Unsupervised Approaches
</SectionTitle>
      <Paragraph position="0"> anchor words. For anchor pairs, the mean error across all 25 pairs is reported, along with its standard deviation.</Paragraph>
      <Paragraph position="1"> Patterns are consistent across the different conditions. A relatively large improvement comes from using all ten anchor words. Smaller benefits arise from including rare words and from normalizing model coefficients.</Paragraph>
      <Paragraph position="2"> Models that use the original pair of anchor words, excellent and poor, perform slightly better than the average pair. Whereas mean performance ranges from 37.3% to 39.6%, misclassification rates for this pair of anchors ranges from 37.4% to 38.1%.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.3 A Smaller Unlabeled Corpus
</SectionTitle>
      <Paragraph position="0"> As described in Section 3.2, there are several reasons to explore the use of a smaller unlabeled corpus, rather than the entire Internet. In our experiments, we use additional movie reviews as our documents. For this domain, Pang makes available 27,886 reviews.4 Because this corpus offers dramatically fewer instances of anchor words, we modify our estimation procedure. Rather than discarding words that rarely co-occur with anchors, we use the same feature set as before and regularize estimates by the same procedure used in the Naive Bayes procedure described earlier.</Paragraph>
      <Paragraph position="1"> Using all features, and ten anchor words with normalized scores, test error is 35.0%. This suggests that comparable results can be attained while referring to a considerably smaller unlabeled corpus. Rather than requiring several days of downloads, the count of nearby co-occurrences was completed in under ten minutes.</Paragraph>
      <Paragraph position="2"> Because this procedure enables fast access to counts, we explore the possibility of dramatically enlarging our collection of anchor words. We col- null tor model with estimated coefficients. The dashed curve uses a Naive Bayes classifier. The two horizontal lines represent unsupervised estimates; the upper one is for the original unsupervised classifier, and the lower is for the most successful unsupervised method.</Paragraph>
      <Paragraph position="3"> lect data for the complete set of WordNet synonyms for the words good, best, bad, boring, and dreadful. This yields a total of 83 anchor words, 35 positive and 48 negative. When all of these anchors are used in conjunction, test error increases to 38.3%. One possible difficulty in using this automated procedure is that some synonyms for a word do not carry the same sentiment orientation. For instance, intense is listed as a synonym for bad, even though its presence in a movie review is a strongly positive indication.5</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.4 Methods with Supervision
</SectionTitle>
      <Paragraph position="0"> As demonstrated in Section 3.3, each anchor word ak is associated with a coefficient k. In unsupervised models, these coefficients are assumed to be known. However, when labeled documents are available, it may be advantageous to estimate them.</Paragraph>
      <Paragraph position="1"> Figure 5 compares the performance of a model with estimated coefficient vector , as opposed to unsupervised models and a traditional supervised approach. When a moderate number of labeled documents are available, it offers a noticeable improvement. null The supervised method used for reference in this case is the Naive Bayes model that is described in section 4.1. Naive Bayes classification is of particular interest here because it converges faster to its asymptotic optimum than do discriminative methods (Ng, A. Y. and Jordan, M., 2002). Further, with 5In the labeled Pang corpus, intense appears in 38 positive reviews and only 6 negative ones.</Paragraph>
      <Paragraph position="2"> a larger number of labeled documents, its performance on this corpus is comparable to that of Support Vector Machines and Maximum Entropy models (Pang et al., 2002).</Paragraph>
      <Paragraph position="3"> The coefficient vector is estimated by regularized logistic regression. This method has been used in other text classification problems, as in Zhang and Yang (2003). In our case, the regularization6 is introduced in order to enforce the beliefs that:  For further information on regularized model fitting, see for instance, Hastie et al. (2001).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML