File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1017_metho.xml
Size: 11,548 bytes
Last Modified: 2025-10-06 14:09:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1017"> <Title>Extracting Semantic Orientations of Words using Spin Model</Title> <Section position="4" start_page="133" end_page="136" type="metho"> <SectionTitle> 3 Spin Model and Mean Field Approximation </SectionTitle> <Paragraph position="0"> We give a brief introduction to the spin model and the mean field approximation, which are well-studied subjects both in the statistical mechanics and the machine learning communities (Geman and Geman, 1984; Inoue and Carlucci, 2001; Mackay, 2003).</Paragraph> <Paragraph position="1"> A spin system is an array of N electrons, each of which has a spin with one of two values &quot;+1 (up)&quot; or &quot;[?]1 (down)&quot;. Two electrons next to each other energetically tend to have the same spin. This model is called the Ising spin model, or simply the spin model (Chandler, 1987). The energy function of a spin system can be represented as</Paragraph> <Paragraph position="3"> where xi and xj ([?] x) are spins of electrons i and j, matrix W = {wij} represents weights between two electrons.</Paragraph> <Paragraph position="4"> In a spin system, the variable vector x follows the</Paragraph> <Paragraph position="6"> where Z(W) = summationtextx exp([?]bE(x,W)) is the normalization factor, which is called the partition function and b is a constant called the inversetemperature. As this distribution function suggests, a configuration with a higher energy value has a smaller probability.</Paragraph> <Paragraph position="7"> Although we have a distribution function, computing various probability values is computationally difficult. The bottleneck is the evaluation of Z(W), since there are 2N configurations of spins in this system. null We therefore approximate P(x|W) with a simple function Q(x;th). The set of parameters th for Q, is determined such that Q(x;th) becomes as similar to P(x|W) as possible. As a measure for the distance between P and Q, the variational free energy F is often used, which is defined as the difference between the mean energy with respect to Q and the</Paragraph> <Paragraph position="9"/> <Paragraph position="11"> The parameters th that minimizes the variational free energy will be chosen. It has been shown that minimizing F is equivalent to minimizing the Kullback-Leibler divergence between P and Q (Mackay, 2003).</Paragraph> <Paragraph position="12"> We next assume that the function Q(x;th) has the factorial form :</Paragraph> <Paragraph position="14"> Simple substitution and transformation leads us to the following variational free energy :</Paragraph> <Paragraph position="16"> With the usual method of Lagrange multipliers, we obtain the mean field equation : We use the spin model to extract semantic orientations of words.</Paragraph> <Paragraph position="17"> Each spin has a direction taking one of two values: up or down. Two neighboring spins tend to have the same direction from a energetic reason. Regarding each word as an electron and its semantic orientation as the spin of the electron, we construct a lexical network by connecting two words if, for example, one word appears in the gloss of the other word. Intuition behind this is that if a word is semantically oriented in one direction, then the words in its gloss tend to be oriented in the same direction.</Paragraph> <Paragraph position="18"> Using the mean-field method developed in statistical mechanics, we determine the semantic orientations on the network in a global manner. The global optimization enables the incorporation of possibly noisy resources such as glosses and corpora, while existing simple methods such as the shortest-path method and the bootstrapping method cannot work in the presence of such noisy evidences. Those methods depend on less-noisy data such as a thesaurus. null</Paragraph> <Section position="1" start_page="134" end_page="135" type="sub_section"> <SectionTitle> 4.1 Construction of Lexical Networks </SectionTitle> <Paragraph position="0"> We construct a lexical network by linking two words if one word appears in the gloss of the other word.</Paragraph> <Paragraph position="1"> Each link belongs to one of two groups: the same-orientation links SL and the different-orientation links DL. If at least one word precedes a negation word (e.g., not) in the gloss of the other word, the link is a different-orientation link. Otherwise the links is a same-orientation link.</Paragraph> <Paragraph position="2"> We next set weights W = (wij) to links :</Paragraph> <Paragraph position="4"> where lij denotes the link between word i and word j, and d(i) denotes the degree of word i, which means the number of words linked with word i. Two words without connections are regarded as being connected by a link of weight 0. We call this network the gloss network (G).</Paragraph> <Paragraph position="5"> We construct another network, the gloss-thesaurus network (GT), by linking synonyms, antonyms and hypernyms, in addition to the the above linked words. Only antonym links are in DL.</Paragraph> <Paragraph position="6"> We enhance the gloss-thesaurus network with cooccurrence information extracted from corpus. As mentioned in Section 2, Hatzivassiloglou and McKeown (1997) used conjunctive expressions in corpus. Following their method, we connect two adjectives if the adjectives appear in a conjunctive form in the corpus. If the adjectives are connected by &quot;and&quot;, the link belongs to SL. If they are connected by &quot;but&quot;, the link belongs to DL. We call this network the gloss-thesaurus-corpus network (GTC).</Paragraph> </Section> <Section position="2" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 4.2 Extraction of Orientations </SectionTitle> <Paragraph position="0"> We suppose that a small number of seed words are given. In other words, we know beforehand the semantic orientations of those given words. We incorporate this small labeled dataset by modifying the previous update rule.</Paragraph> <Paragraph position="1"> Instead of bE(x,W) in Equation (2), we use the following function H(b,x,W) :</Paragraph> <Paragraph position="3"> where L is the set of seed words, ai is the orientation of seed word i, and a is a positive constant. This expression means that if xi (i [?] L) is different from ai, the state is penalized.</Paragraph> <Paragraph position="4"> Using function H, we obtain the new update rule for xi (i [?] L) :</Paragraph> <Paragraph position="6"> where soldi = summationtextj wij-xoldj . -xoldi and -xnewi are the averages of xi respectively before and after update.</Paragraph> <Paragraph position="7"> What is discussed here was constructed with the reference to work by Inoue and Carlucci (2001), in which they applied the spin glass model to image restoration.</Paragraph> <Paragraph position="8"> Initially, the averages of the seed words are set according to their given orientations. The other averages are set to 0.</Paragraph> <Paragraph position="9"> When the difference in the value of the variational free energy is smaller than a threshold before and after update, we regard computation converged.</Paragraph> <Paragraph position="10"> The words with high final average values are classified as positive words. The words with low final average values are classified as negative words.</Paragraph> </Section> <Section position="3" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 4.3 Hyper-parameter Prediction </SectionTitle> <Paragraph position="0"> The performance of the proposed method largely depends on the value of hyper-parameter b. In order to make the method more practical, we propose criteria for determining its value.</Paragraph> <Paragraph position="1"> When a large labeled dataset is available, we can obtain a reliable pseudo leave-one-out error rate :</Paragraph> <Paragraph position="3"> where [t] is 1 if t is negative, otherwise 0, and -xprimei is calculated with the right-hand-side of Equation (6), where the penalty term a(-xi[?]ai)2 in Equation (10) is ignored. We choose b that minimizes this value.</Paragraph> <Paragraph position="4"> However, when a large amount of labeled data is unavailable, the value of pseudo leave-one-out error rate is not reliable. In such cases, we use magnetization m for hyper-parameter prediction :</Paragraph> <Paragraph position="6"> At a high temperature, spins are randomly oriented (paramagnetic phase, m [?] 0). At a low temperature, most of the spins have the same direction (ferromagnetic phase, m negationslash= 0). It is known that at some intermediate temperature, ferromagnetic phase suddenly changes to paramagnetic phase. This phenomenon is called phase transition.</Paragraph> <Paragraph position="7"> Slightly before the phase transition, spins are locally polarized; strongly connected spins have the same polarity, but not in a global way.</Paragraph> <Paragraph position="8"> Intuitively, the state of the lexical network is locally polarized. Therefore, we calculate values of m with several different values of b and select the value just before the phase transition.</Paragraph> </Section> <Section position="4" start_page="135" end_page="136" type="sub_section"> <SectionTitle> 4.4 Discussion on the Model </SectionTitle> <Paragraph position="0"> In our model, the semantic orientations of words are determined according to the averages values of the spins. Despite the heuristic flavor of this decision rule, it has a theoretical background related to maximizer of posterior marginal (MPM) estimation, or 'finite-temperature decoding' (Iba, 1999; Marroquin, 1985). In MPM, the average is the marginal distribution over xi obtained from the distribution over x. We should note that the finite-temperature decoding is quite different from annealing type algorithms or 'zero-temperature decoding', which correspond to maximum a posteriori (MAP) estimation and also often used in natural language processing (Cowie et al., 1992).</Paragraph> <Paragraph position="1"> Since the model estimation has been reduced to simple update calculations, the proposed model is similar to conventional spreading activation approaches, which have been applied, for example, to word sense disambiguation (Veronis and Ide, 1990).</Paragraph> <Paragraph position="2"> Actually, the proposed model can be regarded as a spreading activation model with a specific update rule, as long as we are dealing with 2-class model (2-Ising model).</Paragraph> <Paragraph position="3"> However, there are some advantages in our modelling. The largest advantage is its theoretical background. We have an objective function and its approximation method. We thus have a measure of goodness in model estimation and can use another better approximation method, such as Bethe approximation (Tanaka et al., 2003). The theory tells us which update rule to use. We also have a notion of magnetization, which can be used for hyper-parameter estimation. We can use a plenty of knowledge, methods and algorithms developed in the field of statistical mechanics. We can also extend our model to a multiclass model (Q-Ising model).</Paragraph> <Paragraph position="4"> Another interesting point is the relation to maximum entropy model (Berger et al., 1996), which is popular in the natural language processing community. Our model can be obtained by maximizing the entropy of the probability distribution Q(x) under constraints regarding the energy function.</Paragraph> </Section> </Section> class="xml-element"></Paper>