XML Viewer - p05-1044

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1044_metho.xml
Size: 28,714 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1044">
  <Title>Contrastive Estimation: Training Log-Linear Models on Unlabeled Data[?]</Title>
  <Section position="3" start_page="354" end_page="355" type="metho">
    <SectionTitle>
2 Implicit Negative Evidence
</SectionTitle>
    <Paragraph position="0"> Natural language is a delicate thing. For any plausible sentence, there are many slight perturbations of it that will make it implausible. Consider, for example, the first sentence of this section. Suppose we choose one of its six words at random and remove it; on this example, odds are two to one that the resulting sentence will be ungrammatical. Or, we could randomly choose two adjacent words and transpose them; none of the results are valid conversational English. The learner we describe here takes into account not only the observed positive example, but also a set of similar but deprecated negative examples.</Paragraph>
    <Section position="1" start_page="354" end_page="354" type="sub_section">
      <SectionTitle>
2.1 Learning setting
</SectionTitle>
      <Paragraph position="0"> Let vectorx = &lt;x1,x2,...&gt; , be our observed example sentences, where each xi [?] X, and let y[?]i [?] Y be the unobserved correct hidden structure for xi (e.g., a POS sequence). We seek a model, parameterized by vectorth, such that the (unknown) correct analysis y[?]i is the best analysis for xi (under the model). If y[?]i were observed, a variety of training criteria would be available (see Tab. 1), but y[?]i is unknown, so none apply.</Paragraph>
      <Paragraph position="1"> Typically one turns to the EM algorithm (Dempster et al., 1977), which locally maximizes productdisplay</Paragraph>
      <Paragraph position="3"> where X is a random variable over sentences and Y a random variable over analyses (notation is often abbreviated, eliminating the random variables).</Paragraph>
      <Paragraph position="4"> An often-used alternative to EM is a class of so-called Viterbi approximations, which iteratively find the probabilistically-best ^y and then, on each iteration, solve a supervised problem (see Tab. 1).</Paragraph>
      <Paragraph position="5"> joint likelihood (JL) productdisplay</Paragraph>
      <Paragraph position="7"> written so as to be maximized. None of these criteria are available for unsupervised estimation because they all depend on the correct label, y[?].</Paragraph>
    </Section>
    <Section position="2" start_page="354" end_page="355" type="sub_section">
      <SectionTitle>
2.2 A new approach: contrastive estimation
</SectionTitle>
      <Paragraph position="0"> Our approach instead maximizes</Paragraph>
      <Paragraph position="2"> parenrightBig (2) where the &amp;quot;neighborhood&amp;quot; N(xi) [?] X is a set of implicit negative examples plus the example xi itself. As in EM, p(xi  |...,vectorth) is found by marginalizing over hidden variables (Eq. 1). Note that the xprime [?] N(xi) are not treated as hard negative examples; we merely seek to move probability mass from them to the observed x.</Paragraph>
      <Paragraph position="3"> The neighborhood of x, N(x), contains examples that are perturbations of x. We refer to the mapping N : X - 2X as the neighborhood function, and the optimization of Eq. 2 as contrastive estimation (CE). CE seeks to move probability mass from the neighborhood of an observed xi to xi itself. The learner hypothesizes that good models are those which discriminate an observed example from its neighborhood. Put another way, the learner assumes not only that xi is good, but that xi is locally optimal in example space (X), and that alternative, similar examples (from the neighborhood) are inferior. Rather than explain all of the data, the model must only explain (using hidden variables) why the  observed sentence is better than its neighbors. Of course, the validity of this hypothesis will depend on the form of the neighborhood function.</Paragraph>
      <Paragraph position="4"> Consider, as a concrete example, learning natural language syntax. In Smith and Eisner (2005), we define a sentence's neighborhood to be a set of slightly-altered sentences that use the same lexemes, as suggested at the start of this section. While their syntax is degraded, the inferred meaning of any of these altered sentences is typically close to the intended meaning, yet the speaker chose x and not one of the other xprime [?] N(x). Why? Deletions are likely to violate subcategorization requirements, and transpositions are likely to violate word order requirements--both of which have something to do with syntax. x was the most grammatical option that conveyed the speaker's meaning, hence (we hope) roughly the most grammatical option in the neighborhood N(x), and the syntactic model should make it so.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="355" end_page="357" type="metho">
    <SectionTitle>
3 Log-Linear Models
</SectionTitle>
    <Paragraph position="0"> We have not yet specified the form of our probabilistic model, only that it is parameterized by vectorth [?] Rn. Log-linear models, which we will show are a natural fit for CE, assign probability to an (example, label) pair (x,y) according to</Paragraph>
    <Paragraph position="2"> where the &amp;quot;unnormalized score&amp;quot; u(x,y  |vectorth) is</Paragraph>
    <Paragraph position="4"> The notation above is defined as follows. vectorf : X x Y - Rn[?]0 is a nonnegative vector feature function, and vectorth [?] Rn are the corresponding feature weights (the model's parameters). Because the features can take any form and need not be orthogonal, log-linear models can capture arbitrary dependencies in the data and cleanly incorporate them into a model.</Paragraph>
    <Paragraph position="5"> Z(vectorth) (the partition function) is chosen so thatsummationtext</Paragraph>
    <Paragraph position="7"> vectorth). u is typically easy to compute for a given (x,y), but Z may be much harder to compute. All the objective functions in this paper take the form</Paragraph>
    <Paragraph position="9"> likelihood criterion Ai Bi joint {(xi,y[?]i )} XxY conditional {(xi,y[?]i )} {xi}xY marginal (a l`a EM) {xi}xY XxY contrastive {xi}xY N(xi) xY  estimation with log-linear models in terms of Eq. 5. where Ai [?] Bi (for each i). For log-linear models this is simply</Paragraph>
    <Paragraph position="11"> So there is no need to compute Z(vectorth), but we do need to compute sums over A and B. Tab. 2 summarizes some concrete examples; see also SS3.1-3.2.</Paragraph>
    <Paragraph position="12"> We would prefer to choose an objective function such that these sums are easy. CE focuses on choosing appropriate small contrast sets Bi, both for efficiency and to guide the learner. The natural choice for Ai (which is usually easier to sum over) is the set of (x,y) that are consistent with what was observed (partially or completely) about the ith training example, i.e., the numerator summationtext(x,y)[?]Ai p(x,y  |vectorth) is designed to find p(observation i  |vectorth). The idea is to focus the probability mass within Bi on the subset Ai where the i the training example is known to be.</Paragraph>
    <Paragraph position="13"> It is possible to build log-linear models where each xi is a sequence.2 In this paper, each model is a weighted finite-state automaton (WFSA) where states correspond to POS tags. The parameter vector vectorth [?] Rn specifies a weight for each of the n transitions in the automaton. y is a hidden path through the automaton (determining a POS sequence), and x is the string it emits. u(x,y  |vectorth) is defined by applying exp to the total weight of all transitions in y.</Paragraph>
    <Paragraph position="14"> This is an example of Eqs. 4 and 6 where fj(x,y) is the number of times the path y takes the jth transition. null The partition function Z(vectorth) of the WFSA is found by adding up the u-scores of all paths through the WFSA. For a k-state WFSA, this equates to solving a linear system of k equations in k variables (Tarjan, 1981). But if the WFSA contains cycles this infinite sum may diverge. Alternatives to exact com2These are exemplified by CRFs (Lafferty et al., 2001), which can be viewed alternately as undirected dynamic graphical models with a chain topology, as log-linear models over entire sequences with local features, or as WFSAs. Because &amp;quot;CRF&amp;quot; implies CL estimation, we use the term &amp;quot;WFSA.&amp;quot;  putation, like random sampling (see, e.g., Abney, 1997), will not help to avoid this difficulty; in addition, convergence rates are in general unknown and bounds difficult to prove. We would prefer to sum over finitely many paths in Bi.</Paragraph>
    <Section position="1" start_page="356" end_page="356" type="sub_section">
      <SectionTitle>
3.1 Parameter estimation (supervised)
</SectionTitle>
      <Paragraph position="0"> For log-linear models, both CL and JL estimation (Tab. 1) are available. In terms of Eq. 5, both set Ai = {(xi,y[?]i )}. The difference is in B: for JL, Bi = X x Y, so summing over Bi is equivalent to computing the partition function Z(vectorth). Because that sum is typically difficult, CL is preferred; Bi = {xi} x Y for xi, which is often tractable.</Paragraph>
      <Paragraph position="1"> For sequence models like WFSAs it is computed using a dynamic programming algorithm (the forward algorithm for WFSAs). Klein and Manning (2002) argue for CL on grounds of accuracy, but see also Johnson (2001). See Tab. 2; other contrast sets Bi are also possible.</Paragraph>
      <Paragraph position="2"> When Bi contains only xi paired with the current best competitor (^y) to y[?]i , we have a technique that resembles maximum margin training (Crammer and Singer, 2001). Note that ^y will then change across training iterations, making Bi dynamic.</Paragraph>
    </Section>
    <Section position="2" start_page="356" end_page="356" type="sub_section">
      <SectionTitle>
3.2 Parameter estimation (unsupervised)
</SectionTitle>
      <Paragraph position="0"> The difference between supervised and unsupervised learning is that in the latter case, Ai is forced to sum over label sequences y because they weren't observed. In the unsupervised case, CE maximizes</Paragraph>
      <Paragraph position="2"> In terms of Eq. 5, A = {xi}xY and B = N(xi)xY.</Paragraph>
      <Paragraph position="3"> EM's objective function (Eq. 1) is a special case where N(xi) = X, for all i, and the denominator becomes Z(vectorth). An alternative is to restrict the neighborhood to the set of observed training examples rather than all possible examples (Riezler, 1999; Johnson et al., 1999; Riezler et al., 2000):  Viewed as a CE method, this approach (though effective when there are few hypotheses) seems misguided; the objective says to move mass to each example at the expense of all other training examples. Another variant is conditional EM. Let xi be a pair (xi,1,xi,2) and define the neighborhood to be</Paragraph>
      <Paragraph position="5"> been applied to conditional densities (Jebara and Pentland, 1998) and conditional training of acoustic models with hidden variables (Valtchev et al., 1997).</Paragraph>
      <Paragraph position="6"> Generally speaking, CE is equivalent to some kind of EM when N(*) is an equivalence relation on examples, so that the neighborhoods partition X.</Paragraph>
      <Paragraph position="7"> Then if q is any fixed (untrained) distribution over neighborhoods, CE equates to running EM on the model defined by</Paragraph>
      <Paragraph position="9"> CE may also be viewed as an importance sampling approximation to EM, where the sample space X is replaced by N(xi). We will demonstrate experimentally that CE is not just an approximation to EM; it makes sense from a modeling perspective.</Paragraph>
      <Paragraph position="10"> In SS4, we will describe neighborhoods of sequences that can be represented as acyclic lattices built directly from an observed sequence. The sum over Bi is then the total u-score in our model of all paths in the neighborhood lattice. To compute this, intersect the WFSA and the lattice, obtaining a new acyclic WFSA, and sum the u-scores of all its paths (Eisner, 2002) using a simple dynamic programming algorithm akin to the forward algorithm. The sum over Ai may be computed similarly.</Paragraph>
      <Paragraph position="11"> CE with lattice neighborhoods is not confined to the WFSAs of this paper; when estimating weighted CFGs, the key algorithm is the inside algorithm for lattice parsing (Smith and Eisner, 2005).</Paragraph>
    </Section>
    <Section position="3" start_page="356" end_page="357" type="sub_section">
      <SectionTitle>
3.3 Numerical optimization
</SectionTitle>
      <Paragraph position="0"> To maximize the neighborhood likelihood (Eq. 7), we apply a standard numerical optimization method (L-BFGS) that iteratively climbs the function using knowledge of its value and gradient (Liu and Nocedal, 1989). The partial derivative of LN with respect to the jth feature weight thj is</Paragraph>
      <Paragraph position="2"> This looks similar to the gradient of log-linear likelihood functions on complete data, though the expectation on the left is in those cases replaced by an observed feature value fj(xi,y[?]i ). In this paper, the  natural language is a delicate thing a. DEL1WORD: natural language is a delicate thing language is a delicate thing is a delicate thing  lattice (via composition with the sentence, followed by determinization and minimization) is shown to its right. expectations in Eq. 10 are computed by the forward-backward algorithm generalized to lattices. We emphasize that the function LN is not globally concave; our search will lead only to a local optimum.3 Therefore, as with all unsupervised statistical learning, the bias in the initialization of vectorth will affect the quality of the estimate and the performance of the method. In future we might wish to apply techniques for avoiding local optima, such as deterministic annealing (Smith and Eisner, 2004).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="357" end_page="358" type="metho">
    <SectionTitle>
4 Lattice Neighborhoods
</SectionTitle>
    <Paragraph position="0"> We next consider some non-classical neighborhood functions for sequences. When X = S+ for some symbol alphabet S, certain kinds of neighborhoods have natural, compact representations. Given an input string x = &lt;x1,x2,...,xm&gt; , we write xji for the substring &lt;xi,xi+1,...,xj&gt; and xm1 for the whole string. Consider first the neighborhood consisting of all sequences generated by deleting a single symbol from the m-length sequence xm1 :</Paragraph>
    <Paragraph position="2"> This set consists of m + 1 strings and can be compactly represented as a lattice (see Fig. 1a). Another</Paragraph>
    <Paragraph position="4"> This set can also be compactly represented as a lattice (Fig. 1b). We can combine DEL1WORD and TRANS1 by taking their union; this gives a larger neighborhood, DELORTRANS1.4 The DEL1SUBSEQ neighborhood allows the deletion of any contiguous subsequence of words that is strictly smaller than the whole sequence. This lattice is similar to that of DEL1WORD, but adds some arcs (Fig. 1c); the size of this neighborhood is O(m2).</Paragraph>
    <Paragraph position="5"> A final neighborhood we will consider is LENGTH, which consists of Sm. CE with the LENGTH neighborhood is very similar to EM; it is equivalent to using EM to estimate the parameters of a model defined by Eq. 9 where q is any fixed (untrained) distribution over lengths.</Paragraph>
    <Paragraph position="6"> When the vocabulary S is the set of words in a natural language, it is never fully known; approximations for defining LENGTH = Sm include using observed S from the training set (as we do) or adding a special OOV symbol.</Paragraph>
    <Paragraph position="7"> 4In general, the lattices are obtained by composing the observed sequence with a small FST and determinizing and minimizing the result; the relevant transducers are shown in Fig. 1.  CE cases) varies. The model selected from each criterion using unlabeled development data is circled in the plot. Dataset size is varied in the table at right, which shows models selected using unlabeled development data (&amp;quot;sel.&amp;quot;) and using an oracle (&amp;quot;oracle,&amp;quot; the highest point on a curve). Across conditions, some neighborhood roughly splits the difference between supervised models and EM.</Paragraph>
  </Section>
  <Section position="6" start_page="358" end_page="360" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We compare CE (using neighborhoods from SS4) with EM on POS tagging using unlabeled data.</Paragraph>
    <Section position="1" start_page="358" end_page="359" type="sub_section">
      <SectionTitle>
5.1 Comparison with EM
</SectionTitle>
      <Paragraph position="0"> Our experiments are inspired by those in Merialdo (1994); we train a trigram tagger using only unlabeled data, assuming complete knowledge of the tagging dictionary.5 In our experiments, we varied the amount of data available (12K-96K words of WSJ), the heaviness of smoothing, and the estimation criterion. In all cases, training stopped when the relative change in the criterion fell below 10[?]4 between steps (typically [?] 100 steps). For this corpus and tag set, on average, a tagger must decide between 2.3 tags for a given token.</Paragraph>
      <Paragraph position="1"> The generative model trained by EM was identical to Merialdo's: a second-order HMM. We smoothed using a flat Dirichlet prior with single parameter l for all distributions (l-values from 0 to 10 were tested).6 The model was initialized uniformly.</Paragraph>
      <Paragraph position="2"> The log-linear models trained by CE used the same feature set, though the feature weights are no longer log-probabilities and there are no sum-to-one constraints. In addition to an unsmoothed trial, we tried diagonal Gaussian priors (quadratic penalty) with s2 ranging from 0.1 to 10. The models were initialized with all thj = 0.</Paragraph>
      <Paragraph position="3"> Unsupervised model selection. For each (crite null rion, dataset) pair, we selected the smoothing trial that gave the highest estimation criterion score on a 5K-word development set (also unlabeled).</Paragraph>
      <Paragraph position="4"> Results. The plot in Fig. 2 shows the Viterbi accuracy of each criterion trained on the 96K-word dataset as smoothing was varied; the table shows, for each (criterion, dataset) pair the performance of the selected l or s2 and the one chosen by an oracle. LENGTH, TRANS1, and DELORTRANS1 are consistently the best, far out-stripping EM. These gains dwarf the performance of EM on over 1.1M words (66.6% as reported by Smith and Eisner (2004)), even when the latter uses improved search (70.0%).</Paragraph>
      <Paragraph position="5"> DEL1WORD and DEL1SUBSEQ, on the other hand, are poor, even worse than EM on larger datasets.</Paragraph>
      <Paragraph position="6"> An important result is that neighborhoods do not succeed by virtue of approximating log-linear EM; if that were so, we would expect larger neighborhoods (like DEL1SUBSEQ) to out-perform smaller ones (like TRANS1)--this is not so. DEL1SUBSEQ and DEL1WORD are poor because they do not give helpful classes of negative evidence: deleting a word or a short subsequence often does very little damage. Put another way, models that do a good job of explaining why no word or subsequence should be deleted do not do so using the familiar POS categories. null The LENGTH neighborhood is as close to log-linear EM as it is practical to get. The inconsistencies in the LENGTH curve (Fig. 2) are notable and also appeared at the other training set sizes.</Paragraph>
      <Paragraph position="7"> Believing this might be indicative of brittleness in Viterbi label selection, we computed the expected  DELORTRANS1 TRANS1 LENGTH EM words in trigram trigram+ spelling trigram trigram+ spelling trigram trigram+ spelling trigram tagging dict. sel. oracle sel. oracle sel. oracle sel. oracle sel. oracle sel. oracle sel. oracle  (&amp;quot;sel.&amp;quot;) and oracle model selection (&amp;quot;oracle&amp;quot;) across smoothing parameters are shown. Note that we evaluated on all words (unlike Fig. 3) and used 17 coarse tags, giving higher scores than in Fig. 2. accuracy of the LENGTH models; the same &amp;quot;dips&amp;quot; were present. This could indicate that the learner was trapped in a local maximum, suggesting that, since other criteria did not exhibit this behavior, LENGTH might be a bumpier objective surface. It would be interesting to measure the bumpiness (sensitivity to initial conditions) of different contrastive</Paragraph>
    </Section>
    <Section position="2" start_page="359" end_page="360" type="sub_section">
      <SectionTitle>
objectives.7
5.2 Removing knowledge, adding features
</SectionTitle>
      <Paragraph position="0"> The assumption that the tagging dictionary is completely known is difficult to justify. While a POS lexicon might be available for a new language, certainly it will not give exhaustive information about all word types in a corpus. We experimented with removing knowledge from the tagging dictionary, thereby increasing the difficulty of the task, to see how well various objective functions could recover.</Paragraph>
      <Paragraph position="1"> One means to recovery is the addition of features to the model--this is easy with log-linear models but not with classical generative models.</Paragraph>
      <Paragraph position="2"> We compared the performance of the best neighborhoods (LENGTH, DELORTRANS1, and TRANS1) from the first experiment, plus EM, using three diluted dictionaries and the original one, on the 24K dataset. A diluted dictionary adds (tag, word) entries so that rare words are allowed with any tag, simulating zero prior knowledge about the word. &amp;quot;Rare&amp;quot; might be defined in different ways; we used three definitions: words unseen in the first 500 sentences (about half of the 24K training corpus); singletons (words with count [?] 1); and words with count [?] 2. To allow more trials, we projected the original 45 tags onto a coarser set of 17 (e.g., 7A reviewer suggested including a table comparing different criterion values for each learned model (i.e., each neighborhood evaluated on each other neighborhood). This table contained no big surprises; we note only that most models were the best on their own criterion, and among unsupervised models, LENGTH performed best on the CL criterion.</Paragraph>
      <Paragraph position="3"> RB[?] -ADV).</Paragraph>
      <Paragraph position="4"> To take better advantage of the power of log-linear models--specifically, their ability to incorporate novel features--we also ran trials augmenting the model with spelling features, allowing exploitation of correlations between parts of the word and a possible tag. Our spelling features included all observed 1-, 2-, and 3-character suffixes, initial capitalization, containing a hyphen, and containing a digit. Results. Fig. 3 plots tagging accuracy (on ambiguous words) for each dictionary on the 24K dataset. The x-axis is the smoothing parameter (l for EM, s2 for CE). Note that the different plots are not comparable, because their y-axes are based on different sets of ambiguous words.</Paragraph>
      <Paragraph position="5"> So that models under different dilution conditions could be compared, we computed accuracy on all words; these are shown in Tab. 3. The reader will notice that there is often a large gap between unsupervised and oracle model selection; this draws attention to a need for better unsupervised regularization and model selection techniques.</Paragraph>
      <Paragraph position="6"> Without spelling features, all models perform worse as knowledge is removed. But LENGTH suffers most substantially, relative to its initial performance. Why is this? LENGTH (like EM) requires the model to explain why a given sentence was seen instead of some other sentence of the same length.</Paragraph>
      <Paragraph position="7"> One way to make this explanation is to manipulate emission weights (i.e., for (tag, word) features): the learner can construct a good class-based unigram model of the text (where classes are tags). This is good for the LENGTH objective, but not for learning good POS tag sequences.</Paragraph>
      <Paragraph position="8"> In contrast, DELORTRANS1 and TRANS1 do not allow the learner to manipulate emission weights for words not in the sentence. The sentence's goodness must be explained in a way other than by the words it contains: namely through the POS tags. To  check this intuition, we built local normalized models p(word  |tag) from the parameters learned by TRANS1 and LENGTH. For each tag, these were compared by KL divergence to the empirical lexical distributions (from labeled data). For the ten tags accounting for 95.6% of the data, LENGTH more closely matched the empirical lexical distributions.</Paragraph>
      <Paragraph position="9"> LENGTH is learning a correct distribution, but that distribution is not helpful for the task.</Paragraph>
      <Paragraph position="10"> The improvement from adding spelling features is striking: DELORTRANS1 and TRANS1 recover nearly completely (modulo the model selection problem) from the diluted dictionaries. LENGTH sees far less recovery. Hence even our improved feature sets cannot compensate for the choice of neighborhood. This highlights our argument that a neighborhood is not an approximation to log-linear EM; LENGTH tries very hard to approximate log-linear EM but requires a good dictionary to be on par with the other criteria. Good neighborhoods, rather, perform well in their own right.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="360" end_page="361" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> Foremost for future work is the &amp;quot;minimally supervised&amp;quot; paradigm in which a small amount of labeled data is available (see, e.g., Clark et al. (2003)).</Paragraph>
    <Paragraph position="1"> Unlike well-known &amp;quot;bootstrapping&amp;quot; approaches (Yarowsky, 1995), EM and CE have the possible advantage of maintaining posteriors over hidden labels (or structure) throughout learning; bootstrapping either chooses, for each example, a single label, or remains completely agnostic. One can envision a mixed objective function that tries to fit the labeled examples while discriminating unlabeled examples from their neighborhoods.8 Regardless of how much (if any) data are labeled, the question of good smoothing techniques requires more attention. Here we used a single zero-mean, constant-variance Gaussian prior for all parameters.</Paragraph>
    <Paragraph position="2"> Better performance might be achieved by allowing different variances for different feature types. This 8Zhu and Ghahramani (2002) explored the semi-supervised classification problem for spatially-distributed data, where some data are labeled, using a Boltzmann machine to model the dataset. For them, the Markov random field is over labeling configurations for all examples, not, as in our case, complex structured labels for a particular example. Hence their B (Eq. 5), though very large, was finite and could be sampled. All train &amp; development words are in the tagging dictionary: Tagging dictionary taken from the first 500 sentences:  coarse tags) on the 24K dataset, as the dictionary is diluted and with spelling features. Each graph corresponds to a different level of dilution. Models selected using unlabeled development data are circled. These plots (unlike Tab. 3) are not comparable to each other because each is measured on a different set of ambiguous words.</Paragraph>
    <Paragraph position="3">  leads to a need for more efficient tuning of the prior parameters on development data.</Paragraph>
    <Paragraph position="4"> The effectiveness of CE (and different neighborhoods) for dependency grammar induction is explored in Smith and Eisner (2005) with considerable success. We introduce there the notion of designing neighborhoods to guide learning for particular tasks. Instead of guiding an unsupervised learner to match linguists' annotations, the choice of neighborhood might be made to direct the learner toward hidden structure that is helpful for error-correction tasks like spelling correction and punctuation restoration that may benefit from a grammatical model.</Paragraph>
    <Paragraph position="5"> Wang et al. (2002) discuss the latent maximum entropy principle. They advocate running EM many times and selecting the local maximum that maximizes entropy. One might do the same for the local maxima of any CE objective, though theoretical and experimental support for this idea remain for future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML