File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1062_intro.xml

Size: 11,140 bytes

Last Modified: 2025-10-06 14:02:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1062">
  <Title>Annealing Techniques for Unsupervised Statistical Language Learning</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Deterministic annealing
</SectionTitle>
    <Paragraph position="0"> Suppose our data consist of a pairs of random variables X and Y, where the value of X is observed and Y is hidden. For example, X might range over sentences in English and Y over POS tag sequences. We use X and Y to denote the sets of possible values of X and Y, respectively. We seek to build a model that assigns probabilities to each (x,y)[?]XxY. Letvectorx ={x1,x2,...,xn}be a corpus of unlabeled examples. Assume the class of models is fixed (for example, we might consider only first-order HMMs with s states, corresponding notionally to POS tags). Then the task is to find good parameters vectorth[?]RN for the model. The criterion most commonly used in building such models from unlabeled data is maximum likelihood (ML); we seek the parameters vectorth[?]: argmax</Paragraph>
    <Paragraph position="2"> entropy hilltop. They argue that to account for partiallyobserved (unlabeled) data, one should choose the distribution with the highest Shannon entropy, subject to certain data-driven constraints. They show that this desirable distribution is one of the local maxima of likelihood. Whether high-entropy local maxima really predict test data better is an empirical question.</Paragraph>
    <Paragraph position="3"> Input: vectorx, vectorth(0) Output: vectorth[?]  Each parameter thj corresponds to the conditional probability of a single model event, e.g., a state transition in an HMM or a rewrite in a PCFG. Many NLP models make it easy to maximize the likelihood of supervised training data: simply count the model events in the observed (xi,yi) pairs, and set the conditional probabilities thi to be proportional to the counts. In our unsupervised setting, the yi are unknown, but solving (1) is almost as easy provided that we can obtain the posterior distribution of Y given each xi (that is, Pr(y  |xi) for each y [?] Y and each xi). The only difference is that we must now count the model events fractionally, using the expected number of occurrences of each (xi,y) pair.</Paragraph>
    <Paragraph position="4"> This intuition leads to the EM algorithm in Fig. 1.</Paragraph>
    <Paragraph position="5"> It is guaranteed that Pr(vectorx|vectorth(i+1))[?]Pr(vectorx|vectorth(i)). For language-structure models like HMMs and SCFGs, efficient dynamic programming algorithms (forward-backward, inside-outside) are available to compute the distribution ~p at the E step of Fig. 1 and use it at the M step. These algorithms run in polynomial time and space by structure-sharing the possible y (tag sequences or parse trees) for each xi, of which there may be exponentially many in the length of xi. Even so, the majority of time spent by EM for such models is on the E steps. In this paper, we can fairly compare the runtime of EM and other training procedures by counting the number of E steps they take on a given training set and model.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Generalizing EM
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the deterministic annealing (DA) algorithm derived from the framework of Rose et al.</Paragraph>
      <Paragraph position="1"> (1990). It is quite similar to EM.2 However, DA adds an outer loop that iteratively increases a value b, and computation of the posterior in the E step is modified to involve this b.</Paragraph>
      <Paragraph position="2"> 2Other expositions of DA abound; we have couched ours in data-modeling language. Readers interested in the Lagrangian-based derivations and analogies to statistical physics (including phase transitions and the role of b as the inverse of temperature in free-energy minimization) are referred to Rose (1998) for a thorough discussion.</Paragraph>
      <Paragraph position="3"> Input: vectorx, vectorth(0), bmax&gt;bmin&gt;0, a&gt;1 Output: vectorth[?]  Whenb = 1, DA's inner loop will behave exactly like EM, computing ~p at the E step by the same formula that EM uses. When b [?] 0, ~p will be close to a uniform distribution over the hidden variable vectory, since each numerator Pr(vectorx,vectory  |vectorth)b [?] 1. At such b-values, DA effectively ignores the current parameters th when choosing the posterior ~p and the new parameters. Finally, as b - +[?], ~p tends to place nearly all of the probability mass on the single most likely vectory. This winner-take-all situation is equivalent to the &amp;quot;Viterbi&amp;quot; variant of the EM algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Gradated difficulty
</SectionTitle>
      <Paragraph position="0"> In both the EM and DA algorithms, the E step selects a posterior ~p over the hidden variable vectorY and the M step selects parameters vectorth. Neal and Hinton (1998) show how the EM algorithm can be viewed as optimizing a single objective function over bothvectorth and ~p. DA can also be seen this way; DA's objective function at a given b is</Paragraph>
      <Paragraph position="2"> (2) The EM version simply sets b = 1. A complete derivation is not difficult but is too lengthy to give here; it is a straightforward extension of that given by Neal and Hinton for EM.</Paragraph>
      <Paragraph position="3"> It is clear that the value of b allows us to manipulate the relative importance of the two terms when maximizing F. When b is close to 0, only the H term matters. The H term is the Shannon entropy of the posterior distribution ~p, which is known to be concave in ~p. Maximizing it is simple: set allxto be equiprobable (the uniform distribution). Therefore a sufficiently small b drives up the importance of H relative to the other term, and the entire problem becomes concave with a single global maximum to which we expect to converge.</Paragraph>
      <Paragraph position="4"> In gradually increasing b from near 0 to 1, we start out by solving an easy concave maximization problem and use the result to initialize the next maximization problem, which is slightly more difficult (i.e., less concave). This continues, with the solution to each problem in the series being used to initialize the subsequent problem. When b reaches 1, DA behaves just like EM. Since the objective function is continuous in b where b &gt; 0, we can visualize DA as gradually morphing the easy concave objective function into the one we really care about (likelihood); we hope to &amp;quot;ride the maximum&amp;quot; as b moves toward 1.</Paragraph>
      <Paragraph position="5"> DA guarantees iterative improvement of the objective function (see Ueda and Nakano (1998) for proofs). But it does not guarantee convergence to a global maximum, or even to a better local maximum than EM will find, even with extremely slow b-raising. A new mountain on the surface of the objective function could arise at any stage that is preferable to the one that we will ultimately find.</Paragraph>
      <Paragraph position="6"> To run DA, we must choose a few control parameters. In this paper we set bmax = 1 so that DA will approach EM and finish at a local maximum of likelihood. bmin and the b-increase factor a can be set high for speed, but at a risk of introducing local maxima too quickly for DA to work as intended.</Paragraph>
      <Paragraph position="7"> (Note that a &amp;quot;fast&amp;quot; schedule that tries only a few b values is not as fast as one might expect, since it will generally take longer to converge at each b value.) To conclude the theoretical discussion of DA, we review its desirable properties. DA is robust to initial parameters, since when b is close to 0 the objective hardly depends onvectorth. DA gradually increases the difficulty of search, which may lead to the avoidance of some local optima. By modifying the annealing schedule, we can change the runtime of the DA algorithm. DA is almost exactly like EM in implementation, requiring only a slight modification to the E step (seeSS3) and an additional outer loop.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Prior work
</SectionTitle>
      <Paragraph position="0"> DA was originally described as an algorithm for clustering data in RN (Rose et al., 1990). Its predecessor, simulated annealing, modifies the objective function during search by applying random perturbations of gradually decreasing size (Kirkpatrick et al., 1983). Deterministic annealing moves the randomness &amp;quot;inside&amp;quot; the objective function by taking expectations. DA has since been applied to many problems (Rose, 1998); we describe two key applications in language and speech processing.</Paragraph>
      <Paragraph position="1"> Pereira, Tishby, and Lee (1993) used DA for soft hierarchical clustering of English nouns, based on the verbs that select them as direct objects. In their case, when b is close to 0, each noun is fuzzily placed in each cluster so that Pr(cluster  |noun) is nearly uniform. On the M step, this results in clusters that are almost exactly identical; there is one effective cluster. As b is increased, it becomes increasingly attractive for the cluster centroids to move apart, or &amp;quot;split&amp;quot; into two groups (two effective clusters), and eventually they do so. Continuing to increase b yields a hierarchical clustering through repeated splits. Pereira et al. describe the tradeoff given through b as a control on the locality of influence of each noun on the cluster centroids, so that as b is raised, each noun exerts less influence on more distant centroids and more on the nearest centroids.</Paragraph>
      <Paragraph position="2"> DA has also been applied in speech recognition.</Paragraph>
      <Paragraph position="3"> Rao and Rose (2001) used DA for supervised discriminative training of HMMs. Their goal was to optimize not likelihood but classification error rate, a difficult objective function that is piecewiseconstant (hence not differentiable everywhere) and riddled with shallow local minima. Rao and Rose applied DA,3 moving from training a nearly uniform classifier with a concave cost surface (b[?]0) toward the desired deterministic classifier (b +[?]). They reported substantial gains in spoken letter recognition accuracy over both a ML-trained classifier and a localized error-rate optimizer.</Paragraph>
      <Paragraph position="4"> Brown et al. (1990) gradually increased learning difficulty using a series of increasingly complex models for machine translation. Their training algorithm began by running an EM approximation on the simplest model, then used the result to initialize the next, more complex model (which had greater predictive power and many more parameters), and so on. Whereas DA provides gradated difficulty in parameter search, their learning method involves gradated difficulty among classes of models. The two are orthogonal and could be used together.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML