File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1026_metho.xml

Size: 13,308 bytes

Last Modified: 2025-10-06 14:10:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1026">
  <Title>Latent Variable Models for Semantic Orientations of Phrases</Title>
  <Section position="4" start_page="201" end_page="204" type="metho">
    <SectionTitle>
3 Latent Variable Models for Semantic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="201" end_page="202" type="sub_section">
      <SectionTitle>
Orientations of Phrases
</SectionTitle>
      <Paragraph position="0"> As mentioned in the Introduction, the semantic orientation of a phrase is not a mere sum of its component words. If we know that &amp;quot;low risk&amp;quot; is positive, and that &amp;quot;risk&amp;quot; and &amp;quot;mortality&amp;quot;, in some sense, belong to the same semantic cluster, we can infer that &amp;quot;low mortality&amp;quot; is also positive. Therefore, we propose to use latent variable models to extract such latent semantic clusters and to realize an accurate classification of phrases (we focus  Each node indicates a random variable. Arrows indicate statistical dependency between variables. N, A, Z and C respectively correspond to nouns, adjectives, latent clusters and semantic orientations. on two-term phrases in this paper). The models adopted in this paper are also used for collaborative filtering by Hofmann (2004).</Paragraph>
      <Paragraph position="1"> With these models, the nouns (e.g., &amp;quot;risk&amp;quot; and &amp;quot;mortality&amp;quot;) that become positive by reducing their degree or amount would make a cluster. On the other hand, the adjectives or verbs (e.g., &amp;quot;reduce&amp;quot; and &amp;quot;decrease&amp;quot;) that are related to reduction would also make a cluster.</Paragraph>
      <Paragraph position="2"> Figure 1 shows graphical representations of statistical dependencies of models with a latent variable. N, A, Z and C respectively correspond to nouns, adjectives, latent clusters and semantic orientations. Figure 1-(a) is the PLSI model, which cannot be used in this task due to the absence of a variable for semantic orientations. Figure 1-(b) is the naive bayes model, in which nouns and adjectives are statistically independent of each other given the semantic orientation. Figure 1-(c) is, what we call, the 3-PLSI model, which is the 3observable variable version of the PLSI. We call Figure 1-(d) the triangle model, since three of its four variables make a triangle. We call Figure 1(e) the U-shaped model. In the triangle model and the U-shaped model, adjectives directly influence semantic orientations (rating categories) through the probability P(c|az). While nouns and adjectives are associated with the same set of clusters Z in the 3-PLSI and the triangle models, only nouns are clustered in the U-shaped model.</Paragraph>
      <Paragraph position="3"> In the following, we construct a probability model for the semantic orientations of phrases using each model of (b) to (e) in Figure 1. We explain in detail the triangle model and the U-shaped model, which we will propose to use for this task.</Paragraph>
    </Section>
    <Section position="2" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
3.1 Triangle Model
</SectionTitle>
      <Paragraph position="0"> Suppose that a set D of tuples of noun n, adjective a (predicate, generally) and the rating c is given :</Paragraph>
      <Paragraph position="2"> where c [?] {[?]1,0,1}, for example. This can be easily expanded to the case ofc [?] {1,***,5}. Our purposeistopredicttheratingcforunknownpairs of n and a.</Paragraph>
      <Paragraph position="3"> According to Figure 1-(d), the generative probability of n,a,c,z is the following :</Paragraph>
      <Paragraph position="5"> We use the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) to estimate the parameters of the model. According to the theory of the EM algorithm, we can increase the likelihood of the model with latent variables by iteratively increasing the Q-function. The Q-function (i.e., the expected log-likelihood of the joint probability of complete data with respect to the conditional posterior of the latent variable) is expressed as :</Paragraph>
      <Paragraph position="7"> where th denotes the set of the new parameters.</Paragraph>
      <Paragraph position="8"> fnac denotes the frequency of a tuple n,a,c in the data. -P represents the posterior computed using the current parameters.</Paragraph>
      <Paragraph position="9"> The E-step (expectation step) corresponds to simple posterior computation :</Paragraph>
      <Paragraph position="11"> For derivation of update rules in the M-step (maximization step), we use a simple Lagrange method for this optimization problem with constraints :</Paragraph>
      <Paragraph position="13"> These steps are iteratively computed until convergence. IfthedifferenceofthevaluesofQ-function before and after an iteration becomes smaller than a threshold, we regard it as converged.</Paragraph>
      <Paragraph position="14"> For classification of an unknown pair n,a, we compare the values of</Paragraph>
      <Paragraph position="16"> Then the rating categorycthat maximizeP(c|na) is selected.</Paragraph>
    </Section>
    <Section position="3" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
3.2 U-shaped Model
</SectionTitle>
      <Paragraph position="0"> We suppose that the conditional probability of c and z given n and a is expressed as :</Paragraph>
      <Paragraph position="2"> We compute parameters above using the EM algorithm with the Q-function :</Paragraph>
      <Paragraph position="4"> We obtain the following update rules :</Paragraph>
      <Paragraph position="6"/>
    </Section>
    <Section position="4" start_page="202" end_page="202" type="sub_section">
      <SectionTitle>
3.3 Other Models for Comparison
</SectionTitle>
      <Paragraph position="0"> We will also test the 3-PLSI model corresponding to Figure 1-(c).</Paragraph>
      <Paragraph position="1"> In addition to the latent models, we test a base-line classifier, which uses the posterior probabil-</Paragraph>
      <Paragraph position="3"> This baseline model is equivalent to the 2-term naivebayes classifier(Mitchell, 1997). Thegraphical representation of the naive bayes model is (b) in Figure 1. The parameters are estimated as :</Paragraph>
      <Paragraph position="5"> where |N |and |A |are the numbers of the words for n and a, respectively.</Paragraph>
      <Paragraph position="6"> Thus, we have four different models : naive bayes (baseline), 3-PLSI, triangle, and U-shaped.</Paragraph>
    </Section>
    <Section position="5" start_page="202" end_page="204" type="sub_section">
      <SectionTitle>
3.4 Discussions on the EM computation, the
</SectionTitle>
      <Paragraph position="0"> Models and the Task In the actual EM computation, we use the tempered EM (Hofmann, 2001) instead of the standard EM explained above, because the tempered EM can avoid an inaccurate estimation of the model caused by &amp;quot;over-confidence&amp;quot; in computing the posterior probabilities. The tempered EM can be realized by a slight modification to the E-step, which results in a new E-step :</Paragraph>
      <Paragraph position="2"> for the U-shaped model, where b is a positive hyper-parameter, called the inverse temperature.</Paragraph>
      <Paragraph position="3"> The new E-steps for the other models are similarly expressed.</Paragraph>
      <Paragraph position="4"> Now we have two hyper-parameters : inverse temperature b, and the number of possible values M of latent variables. We determine the values of these hyper-parameters by splitting the  giventrainingdatasetintotwodatasets(thetemporary training dataset 90% and the held-out dataset 10%), and by obtaining the classification accuracy for the held-out dataset, which is yielded by the classifier with the temporary training dataset.</Paragraph>
      <Paragraph position="5"> We should also note that Z (or any variable) should not have incoming arrows simultaneously from N and A, because the model with such arrows has P(z|na), which usually requires an excessively large memory.</Paragraph>
      <Paragraph position="6"> To work with numerical scales of the rating variable (i.e., the difference between c = [?]1 and c = 1 should be larger than that of c = [?]1 and c = 0), Hofmann (2004) used also a GaussiandistributionforP(c|az)incollaborativefilter- null ing. However, we do not employ a Gaussian, becauseinourdataset,thenumberofratingclassesis null  only 3, which is so small that a Gaussian distribution cannot be a good approximation of the actual probability density function. We conducted preliminary experiments with the model with Gaussians, but failed to obtain good results. For other datasets with more classes, Gaussians might be a good model for P(c|az).</Paragraph>
      <Paragraph position="7"> The task we address in this paper is somewhat similar to the trigram prediction task, in the sense that both are classification tasks given two words. However, we should note the difference between these two tasks. In our task, the actual answer given two specific words are fixed as illustrated by the fact 'high+salary' is always positive, while the answer for the trigram prediction task is randomly distributed. We are therefore interested in thesemanticorientations ofunseen pairsof words, while the main purpose of the trigram prediction is accurately estimate the probability of (possibly seen) word sequences.</Paragraph>
      <Paragraph position="8"> In the proposed models, only the words that appeared in the training dataset can be classified. An attempt to deal with the unseen words is an interesting task. For example, we could extend our models to semi-supervised models by regarding C as a partially observable variable. We could also use distributional similarity of words (e.g., based on window-size cooccurrence) to find an observed wordthatismostsimilartothegivenunseenword.</Paragraph>
      <Paragraph position="9"> However, such methods would not work for the semantic orientation classification, because those methods are designed for simple cooccurrence and  cannotdistinguish&amp;quot;survival-rate&amp;quot;from&amp;quot;infectionrate&amp;quot;. In fact, the similarity-based method mentioned above failed to work efficiently in our preliminary experiments. To solve the problem of unseen words, we would have to use other linguistic resources such as a thesaurus or a dictionary.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="204" end_page="204" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="204" end_page="204" type="sub_section">
      <SectionTitle>
4.1 Experimental Settings
</SectionTitle>
      <Paragraph position="0"> We extracted pairs of a noun (subject) and an adjective (predicate), from Mainichi newspaper articles (1995) written in Japanese, and annotated the pairs with semantic orientation tags : positive, neutral or negative. We thus obtained the labeled dataset consisting of 12066 pair instances (7416 different pairs). The dataset contains 4459 negative instances, 4252 neutral instances, and 3355 positiveinstances. Thenumberofdistinctnounsis 4770 and the number of distinct adjectives is 384.</Paragraph>
      <Paragraph position="1"> To check the inter-annotator agreement between two annotators, we calculated k statistics, which was 0.640. This value is allowable, but not quite high. However, positive-negative disagreement is observedforonly0.7%ofthedata. Inotherwords,  thisstatisticsmeansthatthetaskofextractingneutral examples, which has hardly been explored, is intrinsically difficult.</Paragraph>
      <Paragraph position="2"> We employ 10-fold cross-validation to obtain the average value of the classification accuracy.</Paragraph>
      <Paragraph position="3"> We split the dataset such that there is no overlapping pair (i.e., any pair in the training dataset does not appear in the test dataset).</Paragraph>
      <Paragraph position="4"> If either of the two words in a pair in the test dataset does not appear in the training dataset, we excluded the pair from the test dataset since the problem of unknown words is not in the scope of this research. Therefore, we evaluate the pairs that are not in the training dataset, but whose component words appear in the training dataset.</Paragraph>
      <Paragraph position="5"> In addition to the original dataset, which we call the standard dataset, we prepared another dataset inordertoexaminethepowerofthelatentvariable model. The new dataset, which we call the hard dataset, consists only of examples with 17 difficult adjectives such as &amp;quot;high&amp;quot;, &amp;quot;low&amp;quot;, &amp;quot;large&amp;quot;, &amp;quot;small&amp;quot;, &amp;quot;heavy&amp;quot;, and &amp;quot;light&amp;quot;. 1 The semantic orientations of pairs including these difficult words often shift depending on the noun they modify. Thus, the harddatasetisasubsetofthestandarddataset. The size of the hard dataset is 4787. Please note that the hard dataset is used only as a test dataset. For training, we always use the standard dataset in our experiments.</Paragraph>
      <Paragraph position="6"> We performed experiments with all the values of b in {0.1,0.2,***,1.0} and with all the values of M in {10,30,50,70,100,200,300,500}, and predicted the best values of the hyper-parameters with the held-out method in Section 3.4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML