File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1062_metho.xml
Size: 22,906 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1062"> <Title>Annealing Techniques for Unsupervised Statistical Language Learning</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 DA with dynamic programming </SectionTitle> <Paragraph position="0"> We turn now to the practical use of deterministic annealing in NLP. Readers familiar with the EM algorithm will note that, for typical stochastic models of language structure (e.g., HMMs and SCFGs), the bulk of the computational effort is required by the E step, which is accomplished by a two-pass dynamic programming (DP) algorithm (like the forward-backward algorithm). The M step for these models normalizes the posterior expected counts from the E step to get probabilities.4 such models; if we generalize to Markov random fields (also known as log-linear or maximum entropy models) the M step, while still concave, might entail an auxiliary optimization routine such as iterative scaling or a gradient-based method. Running DA for such models is quite simple and requires no modifications to the usual DP algorithms. The only change to make is in the values of the parameters passed to the DP algorithm: simply replace eachthj bythbj . For a givenx, the forward pass of the DP computes (in a dense representation) Pr(y|x,vectorth) for all y. Each Pr(y|x,vectorth) is a product of some of the thj (each thj is multiplied in once for each time its corresponding model event is present in (x,y)). Raising the thj to a power will also raise their product to that power, so the forward pass will compute Pr(y|x,vectorth)b when given vectorthb as parameter values. The backward pass normalizes to the sum; in this case it is the sum of the Pr(y |x,vectorth)b, and we have the E step described in Figure 2. We therefore expect an EM iteration of DA to take the same amount of time as a normal EM iteration.5</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Part-of-speech tagging </SectionTitle> <Paragraph position="0"> We turn now to the task of inducing a trigram POS tagging model (second-order HMM) from an unlabeled corpus. This experiment is inspired by the experiments in Merialdo (1994). As in that work, complete knowledge of the tagging dictionary is assumed. The task is to find the trigram transition probabilities Pr(tagi |tagi[?]1,tagi[?]2) and emission probabilities Pr(wordi |tagi). Merialdo's key result:6 If some labeled data were used to initialize the parameters (by taking the ML estimate), then it was not helpful to improve the model's likelihood through EM iterations, because this almost always hurt the accuracy of the model's Viterbi tagging on a held-out test set. If only a small amount of labeled data was used (200 sentences), then some accuracy improvement was possible using EM, but only for a few iterations. When no labeled data were used, EM was able to improve the accuracy of the tagger, and this improvement continued in the long term.</Paragraph> <Paragraph position="1"> Our replication of Merialdo's experiment used the Wall Street Journal portion of the Penn Tree-bank corpus, reserving a randomly selected 2,000 sentences (48,526 words) for testing. The remaining 47,208 sentences (1,125,240 words) were used in training, without any tags. The tagging dictionary was constructed using the entire corpus (as done by Merialdo). To initialize, the conditional transition and emission distributions in the HMM were set to uniform with slight perturbation. Every distribution was smoothed using add-0.1 smoothing (at every M step). The criterion for convergence is that the relative increase in the objective function between two iterations fall below 10[?]9.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experiment </SectionTitle> <Paragraph position="0"> In the DA condition, we setbmin = 0.0001,bmax = 1, and a = 1.2. Results for the completely unsupervised condition (no labeled data) are shown in Figure 3 and Table 1. Accuracy was nearly monotonic: the final model is approximately the most accurate.</Paragraph> <Paragraph position="1"> DA happily obtained a 10% reduction in tag error rate on training data, and an 11% reduction on test data. On the other hand, it did not manage to improve likelihood over EM. So was the accuracy gain mere luck? Perhaps not. DA may be more resistant to overfitting, because it may favor models whose posteriors ~p have high entropy. At least in this experiment, its initial bias toward such models carried over to the final learned model.7 In other words, the higher-entropy local maximum found by DA, in this case, explained the observed data almost as well without overcommitting to particular tag sequences. The maximum entropy and latent maximum entropy principles (Wang et al., 2003, discussed in footnote 1) are best justified as ways to avoid overfitting.</Paragraph> <Paragraph position="2"> For a supervised tagger, the maximum entropy principle prefers a conditional model Pr(vectory|vectorx) that is maximally unsure about what tag sequence vectory to apply to the training word sequence vectorx (but expects the same feature counts as the truevectory). Such a model is hoped to generalize better to unsupervised data.</Paragraph> <Paragraph position="3"> We can make the same argument. But in our case, the split between supervised/unsupervised data is not the split between training/test data. Our supervised data are, roughly, the fragments of the training corpus that are unambiguously tagged thanks to the tag dictionary.8 The EM model may overfit some accuracy is compared at either the word-level or sentence-level. (Significance at p < 10[?]6 under a binomial sign test in each case. E.g., on the test set, the DA model correctly tagged 1,652 words that EM's model missed while EM correctly tagged 726 words that DA missed. Similarly, the DA model had higher accuracy on 850 sentences, while EM had higher accuracy on only 287. These differences are extremely unlikely to occur due to chance.) The differences in cross-entropy, compared by sentence, were significant in the training set but not the test set (p < 0.01 under a binomial sign test). Recall that lower cross entropy means higher likelihood.</Paragraph> <Paragraph position="4"> parameters to these fragments. The higher-entropy DA model may be less likely to overfit, allowing it to do better on the unsupervised data--i.e., the rest of the training corpus and the entire test corpus.</Paragraph> <Paragraph position="5"> We conclude that DA has settled on a local maximum of the likelihood function that (unsurprisingly) corresponds well with the entropy criterion, and perhaps as a result, does better on accuracy.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Significance </SectionTitle> <Paragraph position="0"> Seeking to determine how well this result generalized, we randomly split the corpus into ten equallysized, nonoverlapping parts. EM and DA were run on each portion;9 the results were inconclusive. DA achieved better test accuracy than EM on three of ten trials, better training likelihood on five trials, and better test likelihood on all ten trials.10 Certainly decreasing the amount of data by an order of magnitude results in increased variance of the performance of any algorithm--so ten small corpora were not enough to determine whether to expect an improvement from DA more often than not.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Mixing labeled and unlabeled data (I) </SectionTitle> <Paragraph position="0"> In the other conditions described by Merialdo, varying amounts of labeled data (ranging from 100 sentences to nearly half of the corpus) were used to initialize the parameters vectorth, which were then trained using EM on the remaining unlabeled data. Only in the case where 100 labeled examples were used, and only for a few iterations, did EM improve the names as interchangeable and could not reasonably be evaluated on gold-standard accuracy.</Paragraph> <Paragraph position="1"> many E steps; DA, 1.3 times). Perhaps the algorithms traveled farther to find a local maximum. We know of no study of the effect of unlabeled training set size on the likelihood surface, but suggest two issues for future exploration. Larger datasets contain more idiosyncrasies but provide a stronger overall signal. Hence, we might expect them to yield a bumpier likelihood surface whose local maxima are more numerous but also differ more noticeably in height. Both these tendencies of larger datasets would in theory increase DA's advantage over EM.</Paragraph> <Paragraph position="2"> accuracy of this model. We replicated these experiments and compared EM with DA; DA damaged the models even more than EM. This is unsurprising; as noted before, DA effectively ignores the initial parameters vectorth(0). Therefore, even if initializing with a model trained on small amounts of labeled data had helped EM, DA would have missed out on this benefit. In the next section we address this issue.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Skewed deterministic annealing </SectionTitle> <Paragraph position="0"> The EM algorithm is quite sensitive to the initial parameters vectorth(0). We touted DA's insensitivity to those parameters as an advantage, but in scenarios where well-chosen initial parameters can be provided (as inSS4.3), we wish for DA to be able exploit them.</Paragraph> <Paragraph position="1"> In particular, there are at least two cases where &quot;good&quot; initializers might be known. One is the case explored by Merialdo, where some labeled data were available to build an initial model. The other is a situation where a good distribution is known over the labels y; we will see an example of this inSS6.</Paragraph> <Paragraph position="2"> We wish to find a way to incorporate an initializer into DA and still reap the benefit of gradated difficulty. To see how this will come about, consider again the E step for DA, which for all y:</Paragraph> <Paragraph position="4"> malizing terms. (Note that Z(vectorth,b) does not depend on y because u(y) is constant with respect to y.) Of course, when b is close to 0, DA chooses the uniform posterior because it has the highest entropy.</Paragraph> <Paragraph position="5"> Seen this way, DA is interpolating in the log domain between two posteriors: the one given by y and vectorth and the uniform one u; the interpolation coefficient is b. To generalize DA, we will replace the uniform u with another posterior, the &quot;skew&quot; posterior 'p, which is an input to the algorithm. This posterior might be specified directly, as it will be in SS6, or it might be computed using an M step from some good initial vectorth(0).</Paragraph> <Paragraph position="7"> When b is close to 0, the E step will choose ~p to be very close to 'p. With small b, SDA is a &quot;cautious&quot; EM variant that is wary of moving too far from the initializing posterior 'p(or, equivalently, the initial parameters vectorth(0)). As b approaches 1, the effect of 'p will diminish, and when b = 1, the algorithm becomes identical to EM. The overall objective (matching (2) except for the boxed term) is: Mixing labeled and unlabeled data (II) Returning to Merialdo's mixed conditions (SS4.3), we found that SDA repaired the damage done by DA but did not offer any benefit over EM. Its behavior in the 100-labeled sentence condition was similar to that of EM's, with a slightly but not significantly higher peak in training set accuracy. In the other conditions, SDA behaved like EM, with steady degradation of accuracy as training proceeded. It ultimately damaged performance only as much as EM did or did slightly better than EM (but still hurt).</Paragraph> <Paragraph position="8"> This is unsurprising: Merialdo's result demonstrated that ML and maximizing accuracy are generally not the same; the EM algorithm consistently degraded the accuracy of his supervised models. SDA is simply another search algorithm with the same criterion as EM. SDA did do what it was expected to do--it used the initializer, repairing DA damage.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Grammar induction </SectionTitle> <Paragraph position="0"> We turn next to the problem of statistical grammar induction: inducing parse trees over unlabeled text.</Paragraph> <Paragraph position="1"> An excellent recent result is by Klein and Manning (2002). The constituent-context model (CCM) they present is a generative, deficient channel model of POS tag strings given binary tree bracketings. We first review the model and describe a small modification that reduces the deficiency, then compare both models under EM and DA.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Constituent-context model </SectionTitle> <Paragraph position="0"> Let (x,y) be a (tag sequence, binary tree) pair. xji denotes the subsequence of x from the ith to the jth word. Let yi,j be 1 if the yield from i to j is a constituent in the tree y and 0 if it is not. The CCM gives to a pair (x,y) the following probability: where ps is a conditional distribution over possible tag-sequence yields (given whether the yield is a constituent or not) and kh is a conditional distribution over possible contexts of one tag on either side of the yield (given whether the yield is a constituent or not). There are therefore four distributions to be estimated; Pr(y) is taken to be uniform.</Paragraph> <Paragraph position="1"> The model is initialized using expected counts of the constituent and context features given that all the trees are generated according to a random-split model.11 The CCM generates each tag not once but O(n2) times, once by every constituent or non-constituent span that dominates it. We suggest the following modification to alleviate some of the deficiency: The change is to condition the yield feature ps on the length of the yield. This decreases deficiency by disallowing, for example, a constituent over a fourtag yield to generate a seven-tag sequence. It also decreases inter-parameter dependence by breaking the constituent (and non-constituent) distributions into a separate bin for each possible constituent length. We will refer to Klein and Manning's CCM and our version as models 1 and 2, respectively.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Experiment </SectionTitle> <Paragraph position="0"> We ran experiments using both CCM models on the tag sequences of length ten or less in the Wall Street Journal Penn Treebank corpus, after extracting punctuation. This corpus consists of 7,519 sentences (52,837 tag tokens, 38 types). We report PARSEVAL scores averaged by constituent (rather than by sentence), and do not give the learner credit for getting full sentences or single tags as constituents.12 Because the E step for this model is computationally intensive, we set the DA parameters at bmin = 0.01,a = 1.5 so that fewer E steps would be necessary.13 The convergence criterion was relative improvement < 10[?]9 in the objective.</Paragraph> <Paragraph position="1"> The results are shown in Table 2. The first point to notice is that a uniform initializer is a bad idea, as Klein and Manning predicted. All conditions but 11We refer readers to Klein and Manning (2002) or Cover and Thomas (1991, p. 72) for details; computing expected counts for a sentence is a closed form operation. Klein and Manning's argument for this initialization step is that it is less biased toward balanced trees than the uniform model used during learning; we also found that it works far better in practice. 12This is why the CCM 1 performance reported here differs from Klein and Manning's; our implementation of the EM condition gave virtually identical results under either evaluation scheme (D. Klein, personal communication).</Paragraph> <Paragraph position="2"> to SDA initialized with a uniform distribution. The third line corresponds to the setup reported by Klein and Manning (2002). UR is unlabeled recall, UP is unlabeled precision, F is their harmonic mean, and CB is the average number of crossing brackets per sentence. All evaluation is on the same data used for unsupervised learning (i.e., there is no training/test split). The high cross-entropy values arise from the deficiency of models 1 and 2, and are not comparable across models. one find better structure when initialized with Klein and Manning's random-split model. (The exception is SDA on model 1; possibly the high deficiency of model 1 interacts poorly with SDA's search in some way.) Next we note that with the random-split initializer, our model 2 is a bit better than model 1 on PARSEVAL measures and converges more quickly.</Paragraph> <Paragraph position="3"> Every instance of DA or SDA achieved higher log-likelihood than the corresponding EM condition. This is what we hoped to gain from annealing: better local maxima. In the case of model 2 with the random-split initializer, SDA significantly out-performed EM (comparing both matches and crossing brackets per sentence under a binomial sign test, p < 10[?]6); we see a > 5% reduction in average crossing brackets per sentence. Thus, our strategy of using DA but modifying it to accept an initializer worked as desired in this case, yielding our best overall performance.</Paragraph> <Paragraph position="4"> The systematic results we describe next suggest that these patterns persist across different training sets in this domain.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Significance </SectionTitle> <Paragraph position="0"> The difficulty we experienced in finding generalization to small datasets, discussed inSS4.2, was apparent here as well. For 10-way and 3-way random, nonoverlapping splits of the dataset, we did not have consistent results in favor of either EM or SDA. Interestingly, we found that training model 2 (using EM or SDA) on 10% of the corpus resulted on average in models that performed nearly as well on their respective training sets as the full corpus condition did on its training set; see Table 3. In addition, SDA sometimes performed as well as EM under model 1. For a random two-way split, EM and SDA converged to almost identical solutions on one of the sub-corpora, and SDA outperformed EM significantly on the other (on model 2).</Paragraph> <Paragraph position="1"> In order to get multiple points of comparison of EM and SDA on this task with a larger amount of data, we jack-knifed the WSJ-10 corpus by splitting it randomly into ten equally-sized nonoverlapping parts then training models on the corpus with each of the ten sub-corpora excluded.14 These trials are not independent of each other; any two of the sub-corpora have 89 of their training data in common. Aggregate results are shown in Table 3. Using model 2, SDA always outperformed EM, and in 8 of 10 cases the difference was significant when comparing matching constituents per sentence (7 of 10 when comparing crossing constituents).15 The variance of SDA was far less than that of EM; SDA not only always performed better with model 2, but its performance was more consistent over the trials.</Paragraph> <Paragraph position="2"> We conclude this experimental discussion by cautioning that both CCM models are highly deficient models, and it is unknown how well they generalize to corpora of longer sentences, other languages, or corpora of words (rather than POS tags).</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Future work </SectionTitle> <Paragraph position="0"> There are a number of interesting directions for future work. Noting the simplicity of the DA algorithm, we hope that current devotees of EM will run comparisons of their models with DA (or SDA).</Paragraph> <Paragraph position="1"> Not only might this improve performance of exist14Note that this is not a cross-validation experiment; results are reported on the unlabeled training set, and the excluded sub-corpus remains unused.</Paragraph> <Paragraph position="2"> 15Binomial sign test, with significance defined as p < 0.05, though all significant results had p < 0.001.</Paragraph> <Paragraph position="3"> performance for 10 trials using 10% of the corpus and 10 jack-knifed trials using 90% of the corpus.</Paragraph> <Paragraph position="4"> ing systems, it will contribute to the general understanding of the likelihood surface for a variety of problems (e.g., this paper has raised the question of how factors like dataset size and model deficiency affect the likelihood surface).</Paragraph> <Paragraph position="5"> DA provides a very natural way to gradually introduce complexity to clustering models (Rose et al., 1990; Pereira et al., 1993). This comes about by manipulating the b parameter; as it rises, the number of effective clusters is allowed to increase.</Paragraph> <Paragraph position="6"> An open question is whether the analogues of &quot;clusters&quot; in tagging and parsing models--tag symbols and grammatical categories, respectively--might be treated in a similar manner under DA. For instance, we might begin with the CCM, the original formulation of which posits only one distinction about constituency (whether a span is a constituent or not) and gradually allow splits in constituent-label space, resulting in multiple grammatical categories that, we hope, arise naturally from the data.</Paragraph> <Paragraph position="7"> In this paper, we used bmax = 1. It would be interesting to explore the effect on accuracy of &quot;quenching,&quot; a phase at the end of optimization that rapidly raises b from 1 to the winner-take-all (Viterbi) variant at b = +[?].</Paragraph> <Paragraph position="8"> Finally, certain practical speedups may be possible. For instance, increasing bmin and a, as noted inSS2.2, will vary the number of E steps required for convergence. We suggested that the change might result in slower or faster convergence; optimizing the schedule using an online algorithm (or determining precisely how these parameters affect the schedule in practice) may prove beneficial. Another possibility is to relax the convergence criterion for earlier b values, requiring fewer E steps before increasing b, or even raising b slightly after every E step (collapsing the outer and inner loops).</Paragraph> </Section> class="xml-element"></Paper>