File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1027_intro.xml
Size: 5,199 bytes
Last Modified: 2025-10-06 14:01:41
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1027"> <Title>Supervised and unsupervised PCFG adaptation to novel domains</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A fundamental concern for nearly all data-driven approaches to language processing is the sparsity of labeled training data. The sparsity of syntactically annotated corpora is widely remarked upon, and some recent papers present approaches to improving performance in the absence of large amounts of annotated training data.</Paragraph> <Paragraph position="1"> Johnson and Riezler (2000) looked at adding features to a maximum entropy model for stochastic unification-based grammars (SUBG), from corpora that are not annotated with the SUBG, but rather with simpler treebank annotations for which there are much larger treebanks. Hwa (2001) demonstrated how active learning techniques can reduce the amount of annotated data required to converge on the best performance, by selecting from among the candidate strings to be annotated in ways which promote more informative examples for earlier annotation. Hwa (1999) and Gildea (2001) looked at adapting parsing models trained on large amounts of annotated data from outside of the domain of interest (out-of-domain), through the use of a relatively small amount of in-domain annotated data. Hwa (1999) used a variant of the inside-outside algorithm presented in Pereira and Schabes (1992) to exploit a partially labeled out-of-domain treebank, and found an advantage to adaptation over direct grammar induction. Gildea (2001) simply added the out-of-domain treebank to his in-domain training data, and derived a very small benefit for his high accuracy, lexicalized parser, concluding that even a large amount of out-of-domain data is of little use for lexicalized parsing.</Paragraph> <Paragraph position="2"> Statistical model adaptation based on sparse in-domain data, however, is neither a new problem nor unique to parsing. It has been studied extensively by researchers working on acoustic modeling for automatic speech recognition (ASR) (Legetter and Woodland, 1995; Gauvain and Lee, 1994; Gales, 1998; Lamel et al., 2002). One of the methods that has received much attention in the ASR literature is maximum a posteriori (MAP) estimation (Gauvain and Lee, 1994). In MAP estimation, the parameters of the model are considered to be random variables themselves with a known distribution (the prior). The prior distribution and the maximum likelihood distribution based on the in-domain observations then give a posterior distribution over the parameters, from which the mode is selected. If the amount of in-domain (adaptation) data is large, the mode of the posterior distribution is mostly defined by the adaptation sample; if the amount of adaptation data is small, the mode will nearly coincide with the mode of the prior distribution. The intuition behind MAP estimation is that once there are sufficient observations, the prior model need no longer be relied upon.</Paragraph> <Paragraph position="3"> Bacchiani and Roark (2003) investigated MAP adaptation of n-gram language models, in a way that is straight-forwardly applicable to probabilistic context-free grammars (PCFGs). Indeed, this approach can be used for any generative probabilistic model, such as part-of-speech taggers.</Paragraph> <Paragraph position="4"> In their language modeling approach, in-domain counts are mixed with the out-of-domain model, so that, if the number of observations within the domain is small, the out-of-domain model is relied upon, whereas if the number of observations in the domain is high, the model will move toward a Maximum Likelihood (ML) estimate on the in-domain data alone. The case of a parsing model trained via relative frequency estimation is identical: in-domain counts can be combined with the out-of-domain model in just such a way. We will show below that weighted count merging is a special case of MAP adaptation; hence the approach of Gildea (2001) cited above is also a special case of MAP adaptation, with a particular parameterization of the prior.</Paragraph> <Paragraph position="5"> This parameterization is not necessarily the one that optimizes performance.</Paragraph> <Paragraph position="6"> In the next section, MAP estimation for PCFGs is presented. This is followed by a brief presentation of the PCFG model that is being learned, and the parser that is used for the empirical trials. We will present empirical results for multiple MAP adaptation schema, both starting from the Penn Wall St. Journal treebank and adapting to the Brown corpus, and vice versa. We will compare our supervised adaptation performance with the results presented in Gildea (2001). In addition to supervised adaptation, i.e.</Paragraph> <Paragraph position="7"> with a manually annotated treebank, we will present results for unsupervised adaptation, i.e. with an automatically annotated treebank. We investigate a number of unsupervised approaches, including multiple iterations, increased sample sizes, and self-adaptation.</Paragraph> </Section> class="xml-element"></Paper>