File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1032_metho.xml
Size: 19,816 bytes
Last Modified: 2025-10-06 14:15:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1032"> <Title>Development and Use of a Gold-Standard Data Set for Subjectivity Classifications</Title> <Section position="5" start_page="246" end_page="246" type="metho"> <SectionTitle> 4. The South African Broadcasting Corp. </SectionTitle> <Paragraph position="0"> said the song &quot;Freedom Now&quot; was &quot;undesirable for broadcasting.&quot; Subjective speech-event sentence.</Paragraph> <Paragraph position="1"> In sentence 4, there is no uncertainty or evaluation expressed toward the speaking event.</Paragraph> <Paragraph position="2"> Thus, from one point of view, one might have considered this sentence to be objective. However, the object of the sentence is not presented as material that is factual to the reporter, so the sentence is classified as subjective.</Paragraph> <Paragraph position="3"> Linguistic categorizations usually do not cover all instances perfectly. For example, sen1 The category specifications in the coding manual axe based on our previous work on tracking point of view (Wiebe, 1994), which builds on Banfield's (1982) linguistic theory of subjectivity.</Paragraph> <Paragraph position="4"> tences may fall on the borderline between two categories. To allow for uncertainty in the annotation process, the specific tags used in this work include certainty ratings, ranging from 0, for least certain, to 3, for most certain. As discussed below in section 3.2, the certainty ratings allow us to investigate whether a model positing additional categories provides a better description of the judges' annotations than a binary model does.</Paragraph> <Paragraph position="5"> Subjective and objective categories are potentially important for many text processing applications, such as information extraction and information retrieval, where the evidential status of information is important. In generation and machine translation, it is desirable to generate text that is appropriately subjective or objective (Hovy, 1987). In summarization, subjectivity judgments could be included in document profiles, to augment automatically produced document summaries, and to help the user make relevance judgments when using a search engine. In addition, they would be useful in text categorization. In related work (Wiebe et al., in preparation), we found that article types, such as announcement and opinion piece, are significantly correlated with the subjective and objective classification.</Paragraph> <Paragraph position="6"> Our subjective category is related to but differs from the statement-opinion category of the Switchboard-DAMSL discourse annotation project (Jurafsky et al., 1997), as well as the gives opinion category of Bale's (1950) model of small-group interaction. All involve expressions of opinion, but while our category specifications focus on evidentiality in text, theirs focus on how conversational participants interact with one another in dialog.</Paragraph> </Section> <Section position="6" start_page="246" end_page="248" type="metho"> <SectionTitle> 3 Statistical Tools </SectionTitle> <Paragraph position="0"> Table 1 presents data for two judges. The rows correspond to the tags assigned by judge 1 and the columns correspond to the tags assigned by judge 2. Let nij denote the number of sentences that judge 1 classifies as i and judge 2 classifies as j, and let/~ij be the probability that a randomly selected sentence is categorized as i by judge 1 and j by judge 2. Then, the maximum likelihood estimate of 15ij is ~ where n_l_ + ,</Paragraph> <Paragraph position="2"> Table 1 shows a four-category data configu-</Paragraph> <Paragraph position="4"> ration, in which certainty ratings 0 and 1 are combined and ratings 2 and 3 are combined.</Paragraph> <Paragraph position="5"> Note that the analyses described in this section cannot be performed on the two-category data configuration (in which the certainty ratings are not considered), due to insufficient degrees of freedom (Bishop et al., 1975).</Paragraph> <Paragraph position="6"> Evidence of confusion among the classifications in Table 1 can be found in the marginal totals, ni+ and n+j. We see that judge 1 has a relative preference, or bias, for objective, while judge 2 has a bias for subjective. Relative bias is one aspect of agreement among judges. A second is whether the judges' disagreements are systematic, that is, correlated. One pattern of systematic disagreement is symmetric disagreement. When disagreement is symmetric, the differences between the actual counts, and the counts expected if the judges' decisions were not correlated, are symmetric; that is, 5n~j = 5n~i for i ~ j, where 5ni~ is the difference from independence. null Our goal is to correct correlated disagreements automatically. We are particularly interested in systematic disagreements resulting from relative bias. We test for evidence of such correlations by fitting probability models to the data. Specifically, we study bias using the model for marginal homogeneity, and symmetric disagreement using the model for quasisymmetry. When there is such evidence, we propose using the latent class model to correct the disagreements; this model posits an unobserved (latent) variable to explain the correlations among the judges' observations.</Paragraph> <Paragraph position="7"> The remainder of this section describes these models in more detail. All models can be evaluated using the freeware package CoCo, which was developed by Badsberg (1995) and is available at: http://web.math.auc.dk/-jhb/CoCo.</Paragraph> <Section position="1" start_page="247" end_page="248" type="sub_section"> <SectionTitle> 3.1 Patterns of Disagreement </SectionTitle> <Paragraph position="0"> A probability model enforces constraints on the counts in the data. The degree to which the counts in the data conform to the constraints is called the fit of the model. In this work, model fit is reported in terms of the likelihood ratio statistic, G 2, and its significance (Read and Cressie, 1988; Dunning, 1993). The higher the G 2 value, the poorer the fit. We will consider model fit to be acceptable if its reference significance level is greater than 0.01 (i.e., if there is greater than a 0.01 probability that the data sample was randomly selected from a population described by the model).</Paragraph> <Paragraph position="1"> Bias of one judge relative to another is evidenced as a discrepancy between the marginal totals for the two judges (i.e., ni+ and n+j in Table 1). Bias is measured by testing the fit of the model for marginal homogeneity: ~i+ = P+i for all i. The larger the G 2 value, the greater the bias. The fit of the model can be evaluated as described on pages 293-294 of Bishop et al.</Paragraph> <Paragraph position="2"> (1975).</Paragraph> <Paragraph position="3"> Judges who show a relative bias do not always agree, but their judgments may still be correlated. As an extreme example, judge 1 may assign the subjective tag whenever judge 2 assigns the objective tag. In this example, there is a kind of symmetry in the judges' responses, but their agreement would be low. Patterns of symmetric disagreement can be identified using the model for quasi-symmetry. This model constrains the off-diagonal counts, i.e., the counts that correspond to disagreement. It states that these counts are the product of a table for independence and a symmetric table, nij = hi+ x )~+j x/~ij, such that /kij = )~ji. In this formula, )~i+ x ,k+j is the model for independence and ),ij is the symmetric interaction term. Intuitively, /~ij represents the difference between the actual counts and those predicted by independence. This model can be evaluated using CoCo as described on pages 289-290 of Bishop et al. (1975).</Paragraph> </Section> <Section position="2" start_page="248" end_page="248" type="sub_section"> <SectionTitle> 3.2 Producing Bias-Corrected Tags </SectionTitle> <Paragraph position="0"> We use the latent class model to correct symmetric disagreements that appear to result from bias. The latent class model was first introduced by Lazarsfeld (1966) and was later made computationally efficient by Goodman (1974).</Paragraph> <Paragraph position="1"> Goodman's procedure is a specialization of the EM algorithm (Dempster et al., 1977), which is implemented in the freeware program CoCo (Badsberg, 1995). Since its development, the latent class model has been widely applied, and is the underlying model in various unsupervised machine learning algorithms, including Auto-Class (Cheeseman and Stutz, 1996).</Paragraph> <Paragraph position="2"> The form of the latent class model is that of naive Bayes: the observed variables are all conditionally independent of one another, given the value of the latent variable. The latent variable represents the true state of the object, and is the source of the correlations among the observed variables.</Paragraph> <Paragraph position="3"> As applied here, the observed variables are the classifications assigned by the judges. Let B, D, J, and M be these variables, and let L be the latent variable. Then, the latent class model is:</Paragraph> <Paragraph position="5"> The parameters of the model are {p(b, l),p(d, l),p(j, l),p(m, l)p(l)}. Once estimates of these parameters are obtained, each clause can be assigned the most probable latent category given the tags assigned by the judges.</Paragraph> <Paragraph position="6"> The EM algorithm takes as input the number of latent categories hypothesized, i.e., the number of values of L, and produces estimates of the parameters. For a description of this process, see Goodman (1974), Dawid & Skene (1979), or Pedersen & Bruce (1998).</Paragraph> <Paragraph position="7"> Three versions of the latent class model are considered in this study, one with two latent categories, one with three latent categories, and one with four. We apply these models to three data configurations: one with two categories (subjective and objective with no certainty ratings), one with four categories (subjective and objective with coarse-grained certainty ratings, as shown in Table 1), and one with eight categories (subjective and objective with fine-grained certainty ratings). All combinations of model and data configuration are evaluated, except the four-category latent class model with the two-category data configuration, due to insufficient degrees of freedom.</Paragraph> <Paragraph position="8"> In all cases, the models fit the data well, as measured by G 2. The model chosen as final is the one for which the agreement among the latent categories assigned to the three data configurations is highest, that is, the model that is most consistent across the three data configurations. null</Paragraph> </Section> </Section> <Section position="7" start_page="248" end_page="250" type="metho"> <SectionTitle> 4 Improving Agreement in </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="248" end_page="250" type="sub_section"> <SectionTitle> Discourse Tagging </SectionTitle> <Paragraph position="0"> Our annotation project consists of the following steps: 2 1. A first draft of the coding instructions is developed.</Paragraph> <Paragraph position="1"> 2. Four judges annotate a corpus according to the first coding manual, each spending about four hours.</Paragraph> <Paragraph position="2"> 3. The annotated corpus is statistically analyzed using the methods presented in section 3, and bias-corrected tags are produced. null 4. The judges are given lists of sentences for which their tags differ from the bias-corrected tags. Judges M, D, and J participate in interactive discussions centered around the differences. In addition, after reviewing his or her list of differences, each judge provides feedback, agreeing with the bias-corrected tag in many cases, but arguing for his or her own tag in some cases. Based on the judges' feedback, 22 of the 504 bias-corrected tags are changed, and a second draft of the coding manual is written. null 5. A second corpus is annotated by the same four judges according to the new coding manual. Each spends about five hours. 6. The results of the second tagging experiment are analyzed using the methods de- null scribed in section 3, and bias-corrected tags are produced for the second data set. Two disjoint corpora are used in steps 2 and 5, both consisting of complete articles taken from the Wall Street Journal Treebank Corpus (Marcus et al., 1993). In both corpora, judges assign tags to each non-compound sentence and to each conjunct of each compound sentence, 504 in the first corpus and 500 in the second. The segmentation of compound sentences was performed manually before the judges received the data.</Paragraph> <Paragraph position="3"> Judges J and B, the first two authors of this paper, are NLP researchers. Judge M is an undergraduate computer science student, and judge D has no background in computer science or linguistics. Judge J, with help from M, developed the original coding instructions, and Judge J directed the process in step 4.</Paragraph> <Paragraph position="4"> The analysis performed in step 3 reveals strong evidence of relative bias among the judges. Each pairwise comparison of judges also shows a strong pattern of symmetric disagreement. The two-category latent class model produces the most consistent clusters across the data configurations. It, therefore, is used to define the bias-corrected tags.</Paragraph> <Paragraph position="5"> In step 4, judge B was excluded from the interactive discussion for logistical reasons. Discussion is apparently important, because, although B's Kappa values for the first study are on par with the others, B's Kappa values for agreement with the other judges change very little from the first to the second study (this is true across the range of certainty values). In contrast, agreement among the other judges noticeably improves. Because judge B's poor Performance in the second tagging experiment is linked to a difference in procedure, judge B's tags are excluded from our subsequent analysis of the data gathered during the second tagging experiment.</Paragraph> <Paragraph position="6"> Table 2 shows the changes, from study 1 to study 2, in the Kappa values for pairwise agreement among the judges. The best results are clearly for the two who are not authors of this paper (D and M). The Kappa value for the agreement between D and M considering all certainty ratings reaches .76, which allows tentative conclusions on Krippendorf's scale (1980). If we exclude the sentences with certainty rating 0, the Kappa values for pairwise agreement between M and D and between J and M are both over .8, which allows definite conclusions on Krippendorf's scale. Finally, if we only consider sentences with certainty 2 or 3, the pair-wise agreements among M, D, and J all have high Kappa values, 0.87 and over.</Paragraph> <Paragraph position="7"> We are aware of only one previous project reporting intercoder agreement results for similar categories, the switchboard-DAMSL project mentioned above. While their Kappa results are very good for other tags, the opinion-statement tagging was not very successful: &quot;The distinction was very hard to make by labelers, and accounted for a large proportion of our interlabeler error&quot; (Jurafsky et al., 1997). In step 6, as in step 3, there is strong evidence of relative bias among judges D, J and M. Each pairwise comparison of judges also shows a strong pattern of symmetric disagreement. The results of this analysis are presented in Table 3. 3 Also as in step 3, the two-category latent class model produces the most consistent clusters across the data configurations. Thus, it is used to define the bias-corrected tags for the second data set as well.</Paragraph> </Section> </Section> <Section position="8" start_page="250" end_page="251" type="metho"> <SectionTitle> 5 Machine Learning Results </SectionTitle> <Paragraph position="0"> Recently, there have been many successful applications of machine learning to discourse processing, such as (Litman, 1996; Samuel et al., 1998). In this section, we report the results of machine learning experiments, in which we develop probablistic classifiers to automatically perform the subjective and objective classification. In the method we use for developing classifters (Bruce and Wiebe, 1999), a search is performed to find a probability model that captures important interdependencies among features. Because features can be dropped and added during search, the method also performs feature selection.</Paragraph> <Paragraph position="1"> In these experiments, the system considers naive Bayes, full independence, full interdependence, and models generated from those using forward and backward search. The model selected is the one with the highest accuracy on a held-out portion of the training data.</Paragraph> <Paragraph position="2"> 10-fold cross validation is performed. The data is partitioned randomly into 10 different SFor the analysis in Table 3, certainty ratings 0 and 1, and 2 and 3 are combined. Similar results are obtained when all ratings are treated as distinct.</Paragraph> <Paragraph position="3"> sets. On each fold, one set is used for testing, and the other nine are used for training. Feature selection, model selection, and parameter estimation are performed anew on each fold.</Paragraph> <Paragraph position="4"> The following are the potential features considered on each fold. A binary feature is included for each of the following: the presence in the sentence of a pronoun, an adjective, a cardinal number, a modal other than will, and an adverb other than not. We also include a binary feature representing whether or not the sentence begins a new paragraph. Finally, a feature is included representing co-occurrence of word tokens and punctuation marks with the subjective and objective classification. 4 There are many other features to investigate in future work, such as features based on tags assigned to previous utterances (see, e.g., (Wiebe et al., 1997; Samuel et al., 1998)), and features based on semantic classes, such as positive and negative polarity adjectives (Hatzivassiloglou and McKeown, 1997) and reporting verbs (Bergler, 1992).</Paragraph> <Paragraph position="5"> The data consists of the concatenation of the two corpora annotated with bias-corrected tags as described above. The baseline accuracy, i.e., the frequency of the more frequent class, is only 51%.</Paragraph> <Paragraph position="6"> The results of the experiments are very promising. The average accuracy across all folds is 72.17%, more than 20 percentage points higher than the baseline accuracy. Interestingly, the system performs better on the sentences for which the judges are certain. In a post hoc analysis, we consider the sentences from the second data set for which judges M, J, and D rate their certainty as 2 or 3. There are 299/500 such sentences. For each fold, we calculate the system's accuracy on the subset of the test set consisting of such sentences. The average accuracy of the subsets across folds is 81.5%.</Paragraph> <Paragraph position="7"> Taking human performance as an upper bound, the system has room for improvement.</Paragraph> <Paragraph position="8"> The average pairwise percentage agreement between D, J, and M and the bias-corrected tags in the entire data set is 89.5%, while the system's percentage agreement with the bias-corrected tags (i.e., its accuracy) is 72.17%.</Paragraph> <Paragraph position="9"> aThe per-class enumerated feature representation from (Wiebe et ai., 1998) is used, with 60% as the conditional independence cutoff threshold.</Paragraph> </Section> class="xml-element"></Paper>