File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1044_intro.xml
Size: 3,900 bytes
Last Modified: 2025-10-06 14:03:03
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1044"> <Title>Contrastive Estimation: Training Log-Linear Models on Unlabeled Data[?]</Title> <Section position="2" start_page="0" end_page="354" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Finding linguistic structure in raw text is not easy.</Paragraph> <Paragraph position="1"> The classical forward-backward and inside-outside algorithms try to guide probabilistic models to discover structure in text, but they tend to get stuck in local maxima (Charniak, 1993). Even when they avoid local maxima (e.g., through clever initialization) they typically deviate from human ideas of what the &quot;right&quot; structure is (Merialdo, 1994).</Paragraph> <Paragraph position="2"> One strategy is to incorporate domain knowledge into the model's structure. Instead of blind HMMs or PCFGs, one could use models whose features [?]This work was supported by a Fannie and John Hertz Foundation fellowship to the first author and NSF ITR grant IIS0313193 to the second author. The views expressed are not necessarily endorsed by the sponsors. The authors also thank three anonymous ACL reviewers for helpful comments, colleagues at JHU CLSP (especially David Smith and Roy Tromble) and Miles Osborne for insightful feedback, and Eric Goldlust and Markus Dreyer for Dyna language support.</Paragraph> <Paragraph position="3"> are crafted to pay attention to a range of domain-specific linguistic cues. Log-linear models can be so crafted and have already achieved excellent performance when trained on annotated data, where they are known as &quot;maximum entropy&quot; models (Ratnaparkhi et al., 1994; Rosenfeld, 1994).</Paragraph> <Paragraph position="4"> Our goal is to learn log-linear models from unannotated data. Since the forward-backward and inside-outside algorithms are instances of Expectation-Maximization (EM) (Dempster et al., 1977), a natural approach is to construct EM algorithms that handle log-linear models. Riezler (1999) did so, then resorted to an approximation because the true objective function was hard to normalize.</Paragraph> <Paragraph position="5"> Stepping back from EM, we may generally envision parameter estimation for probabilistic modeling as pushing probability mass toward the training examples. We must consider not only where the learner pushes the mass, but also from where the mass is taken. In this paper, we describe an alternative to EM: contrastive estimation (CE), which (unlike EM) explicitly states the source of the probability mass that is to be given to an example.1 One reason is to make normalization efficient. Indeed, CE generalizes EM and other practical techniques used to train log-linear models, including conditional estimation (for the supervised case) and Riezler's approximation (for the unsupervised case).</Paragraph> <Paragraph position="6"> The other reason to use CE is to improve accuracy. CE offers an additional way to inject domain knowledge into unsupervised learning (Smith and Eisner, 2005). CE hypothesizes that each positive example in training implies a domain-specific set of examples which are (for the most part) degraded (SS2). This class of implicit negative evidence provides the source of probability mass for the observed example. We discuss the application of CE to log-linear models in SS3.</Paragraph> <Paragraph position="7"> We are particularly interested in log-linear models over sequences, like the conditional random fields (CRFs) of Lafferty et al. (2001) and weighted CFGs (Miyao and Tsujii, 2002). For a given sequence, implicit negative evidence can be represented as a lattice derived by finite-state operations (SS4). Effectiveness of the approach on POS tagging using unlabeled data is demonstrated (SS5). We discuss future work (SS6) and conclude (SS7).</Paragraph> </Section> class="xml-element"></Paper>