File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1020_metho.xml
Size: 12,961 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1020"> <Title>Effective Self-Training for Parsing</Title> <Section position="3" start_page="0" end_page="152" type="metho"> <SectionTitle> 2 Previous work </SectionTitle> <Paragraph position="0"> A simple method of incorporating unlabeled data into a new model is self-training. In self-training, the existing model first labels unlabeled data. The newly labeled data is then treated as truth and combined with the actual labeled data to train a new model. This process can be iterated over different sets of unlabeled data if desired. It is not surprising that self-training is not normally effective: Charniak (1997) and Steedman et al. (2003) report either minor improvements or significant damage from using self-training for parsing. Clark et al. (2003) applies self-training to POS-tagging and reports the same outcomes. One would assume that errors in the original model would be amplified in the new model.</Paragraph> <Paragraph position="1"> Parser adaptation can be framed as a semi-supervised or unsupervised learning problem. In parser adaptation, one is given annotated training data from a source domain and unannotated data from a target. In some cases, some annotated data from the target domain is available as well. The goal is to use the various data sets to produce a model that accurately parses the target domain data despite seeing little or no annotated data from that domain.</Paragraph> <Paragraph position="2"> Gildea (2001) and Bacchiani et al. (2006) show that out-of-domain training data can improve parsing ac- null curacy. The unsupervised adaptation experiment by Bacchiani et al. (2006) is the only successful instance of parsing self-training that we have found.</Paragraph> <Paragraph position="3"> Our work differs in that all our data is in-domain while Bacchiani et al. uses the Brown corpus as labelled data. These correspond to different scenarios. Additionally, we explore the use of a reranker.</Paragraph> <Paragraph position="4"> Co-training is another way to train models from unlabeled data (Blum and Mitchell, 1998). Unlike self-training, co-training requires multiple learners, each with a different &quot;view&quot; of the data. When one learner is confident of its predictions about the data, we apply the predicted label of the data to the training set of the other learners. A variation suggested by Dasgupta et al. (2001) is to add data to the training set when multiple learners agree on the label. If this is the case, we can be more confident that the data was labelled correctly than if only one learner had labelled it.</Paragraph> <Paragraph position="5"> Sarkar (2001) and Steedman et al. (2003) investigated using co-training for parsing. These studies suggest that this type of co-training is most effective when small amounts of labelled training data is available. Additionally, co-training for parsing can be helpful for parser adaptation.</Paragraph> </Section> <Section position="4" start_page="152" end_page="153" type="metho"> <SectionTitle> 3 Experimental Setup </SectionTitle> <Paragraph position="0"> Our parsing model consists of two phases. First, we use a generative parser to produce a list of the top n parses. Next, a discriminative reranker reorders the n-best list. These components constitute two views of the data, though the reranker's view is restricted to the parses suggested by the first-stage parser. The reranker is not able to suggest new parses and, moreover, uses the probability of each parse tree according to the parser as a feature to perform the reranking. Nevertheless, the reranker's value comes from its ability to make use of more powerful features.</Paragraph> <Section position="1" start_page="152" end_page="152" type="sub_section"> <SectionTitle> 3.1 The first-stage 50-best parser </SectionTitle> <Paragraph position="0"> The first stage of our parser is the lexicalized probabilistic context-free parser described in (Charniak, 2000) and (Charniak and Johnson, 2005). The parser's grammar is a smoothed third-order Markov grammar, enhanced with lexical heads, their parts of speech, and parent and grandparent information. The parser uses five probability distributions, one each for heads, their parts-of-speech, headconstituent, left-of-head constituents, and right-ofhead constituents. As all distributions are conditioned with five or more features, they are all heavily backed off using Chen back-off (the average-count method from Chen and Goodman (1996)). Also, the statistics are lightly pruned to remove those that are statistically less reliable/useful. As in (Charniak and Johnson, 2005) the parser has been modified to produce n-best parses. However, the n-best parsing algorithm described in that paper has been replaced by the much more efficient algorithm described in (Jimenez and Marzal, 2000; Huang and Chang, 2005).</Paragraph> </Section> <Section position="2" start_page="152" end_page="152" type="sub_section"> <SectionTitle> 3.2 The MaxEnt Reranker </SectionTitle> <Paragraph position="0"> The second stage of our parser is a Maximum Entropy reranker, as described in (Charniak and Johnson, 2005). The reranker takes the 50-best parses for each sentence produced by the first-stage 50-best parser and selects the best parse from those 50 parses. It does this using the reranking methodology described in Collins (2000), using a Maximum Entropy model with Gaussian regularization as described in Johnson et al. (1999). Our reranker classifies each parse with respect to 1,333,519 features (most of which only occur on few parses).</Paragraph> <Paragraph position="1"> The features consist of those described in (Charniak and Johnson, 2005), together with an additional 601,577 features. These features consist of the partsof-speech, possibly together with the words, that surround (i.e., precede or follow) the left and right edges of each constituent. The features actually used in the parser consist of all singletons and pairs of such features that have different values for at least one of the best and non-best parses of at least 5 sentences in the training data. There are 147,456 such features involving only parts-of-speech and 454,101 features involving parts-of-speech and words. These additional features are largely responsible for improving the reranker's performance on section 23 to 91.3% f-score (Charniak and Johnson (2005) reported an f-score of 91.0% on section 23).</Paragraph> </Section> <Section position="3" start_page="152" end_page="153" type="sub_section"> <SectionTitle> 3.3 Corpora </SectionTitle> <Paragraph position="0"> Our labeled data comes from the Penn Treebank (Marcus et al., 1993) and consists of about 40,000 sentences from Wall Street Journal (WSJ) articles annotated with syntactic information. We use the standard divisions: Sections 2 through 21 are used for training, section 24 is held-out development, and section 23 is used for final testing. Our unlabeled data is the North American News Text corpus, NANC (Graff, 1995), which is approximately 24 million unlabeled sentences from various news sources. NANC contains no syntactic information. Sentence boundaries in NANC are induced by a simple discriminative model. We also perform some basic cleanups on NANC to ease parsing. NANC contains news articles from various news sources including the Wall Street Journal, though for this paper, we only use articles from the LA Times.</Paragraph> </Section> </Section> <Section position="5" start_page="153" end_page="154" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> We use the reranking parser to produce 50-best parses of unlabeled news articles from NANC. Next, we produce two sets of one-best lists from these 50-best lists. The parser-best and reranker-best lists represent the best parse for each sentence according to the parser and reranker, respectively. Finally, we mix a portion of parser-best or reranker-best lists with the standard Wall Street Journal training data (sections 2-21) to retrain a new parser (but not reranker1) model. The Wall Street Journal training data is combined with the NANC data in the following way: The count of each parsing event is the (optionally weighted) sum of the counts of that event in Wall Street Journal and NANC. Bacchiani et al. (2006) show that count merging is more effective than creating multiple models and calculating weights for each model (model interpolation). Intuitively, this corresponds to concatenating our training sets, possibly with multiple copies of each to account for weighting.</Paragraph> <Paragraph position="1"> Some notes regarding evaluations: All numbers reported are f-scores2. In some cases, we evaluate only the parser's performance to isolate it from the reranker. In other cases, we evaluate the reranking parser as a whole. In these cases, we will use the term reranking parser.</Paragraph> <Paragraph position="2"> Table 1 shows the difference in parser's (not reranker's) performance when trained on parser-best reranker-best sentences from NANC to WSJ training data. While the reranker was used to produce the reranker-best sentences, we performed this evaluation using only the first-stage parser to parse all sentences from section 22. We did not train a model where we added 2,000k parser-best sentences.</Paragraph> <Paragraph position="3"> output versus reranker-best output. Adding parser-best sentences recreates previous self-training experiments and confirms that it is not beneficial.</Paragraph> <Paragraph position="4"> However, we see a large improvement from adding reranker-best sentences. One may expect to see a monotonic improvement from this technique, but this is not quite the case, as seen when we add 1,000k sentences. This may be due to some sections of NANC being less similar to WSJ or containing more noise. Another possibility is that these sections contains harder sentences which we cannot parse as accurately and thus are not as useful for self-training. For our remaining experiments, we will only use reranker-best lists.</Paragraph> <Paragraph position="5"> We also attempt to discover the optimal number of sentences to add from NANC. Much of the improvement comes from the addition of the initial 50,000 sentences, showing that even small amounts of new data can have a significant effect. As we add more data, it becomes clear that the maximum benefit to parsing accuracy by strictly adding reranker-best sentences is about 0.7% and that f-scores will asymptote around 91.0%. We will return to this when we consider the relative weightings of WSJ and NANC data.</Paragraph> <Paragraph position="6"> One hypothesis we consider is that the reranked NANC data incorporated some of the features from the reranker. If this were the case, we would not see an improvement when evaluating a reranking parser reranked sentences from NANC to WSJ training.</Paragraph> <Paragraph position="7"> These evaluations were performed on all sentences.</Paragraph> <Paragraph position="8"> on the same models. In Table 2, we see that the new NANC data contains some information orthogonal to the reranker and improves parsing accuracy of the reranking parser.</Paragraph> <Paragraph position="9"> Up to this point, we have only considered giving our true training data a relative weight of one. Increasing the weight of the Wall Street Journal data should improve, or at least not hurt, parsing performance. Indeed, this is the case for both the parser (figure not shown) and reranking parser (Figure 1).</Paragraph> <Paragraph position="10"> Adding more weight to the Wall Street Journal data ensures that the counts of our events will be closer to our more accurate data source while still incorporating new data from NANC. While it appears that the performance still levels off after adding about one million sentences from NANC, the curves corresponding to higher WSJ weights achieve a higher asymptote. Looking at the performance of various weights across sections 1, 22, and 24, we decided that the best combination of training data is to give WSJ a relative weight of 5 and use the first 1,750k reranker-best sentences from NANC.</Paragraph> <Paragraph position="11"> Finally, we evaluate our new model on the test section of Wall Street Journal. In Table 3, we note that baseline system (i.e. the parser and reranker trained purely on Wall Street Journal) has improved by 0.3% over Charniak and Johnson (2005). The 92.1% f-score is the 1.1% absolute improvement mentioned in the abstract. The improvement from self-training is significant in both macro and micro freranker are the evaluation of the parser and reranking parser on all sentences, respectively. &quot;WSJ + NANC&quot; represents the system trained on WSJ training (with a relative weight of 5) and 1,750k sentences from the reranker-best list of NANC.</Paragraph> </Section> class="xml-element"></Paper>