File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1020_evalu.xml
Size: 8,505 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1020"> <Title>Effective Self-Training for Parsing</Title> <Section position="6" start_page="154" end_page="157" type="evalu"> <SectionTitle> 5 Analysis </SectionTitle> <Paragraph position="0"> We performed several types of analysis to better understand why the new model performs better. We first look at global changes, and then at changes at the sentence level.</Paragraph> <Section position="1" start_page="154" end_page="155" type="sub_section"> <SectionTitle> 5.1 Global Changes </SectionTitle> <Paragraph position="0"> It is important to keep in mind that while the reranker seems to be key to our performance improvement, the reranker per se never sees the extra data. It only sees the 50-best lists produced by the first-stage parser. Thus, the nature of the changes to this output is important.</Paragraph> <Paragraph position="1"> We have already noted that the first-stage parser's one-best has significantly improved (see Table 1). In Table 4, we see that the 50-best oracle rate also im- null by baseline, a small self-trained parser, and the &quot;best&quot; parser proves from 95.5% for the original first-stage parser, to 96.4% for our final model. We do not show it here, but if we self-train using first-stage one-best, there is no change in oracle rate.</Paragraph> <Paragraph position="2"> The first-stage parser also becomes more &quot;decisive.&quot; The average (geometric mean) of log2(Pr(1best) / Pr(50th-best)) (i.e. the ratios between the probabilities in log space) increases from 11.959 for the baseline parser, to 14.104 for the final parser. We have seen earlier that this &quot;confidence&quot; is deserved, as the first-stage one-best is so much better.</Paragraph> </Section> <Section position="2" start_page="155" end_page="157" type="sub_section"> <SectionTitle> 5.2 Sentence-level Analysis </SectionTitle> <Paragraph position="0"> To this point we have looked at bulk properties of the data fed to the reranker. It has higher one best and 50-best-oracle rates, and the probabilities are more skewed (the higher probabilities get higher, the lows get lower). We now look at sentence-level properties. In particular, we analyzed the parsers' behavior on 5,039 sentences in sections 1, 22 and 24 of the Penn treebank. Specifically, we classified each sentence into one of three classes: those where the self-trained parser's f-score increased relative to the baseline parser's f-score, those where the f-score remained the same, and those where the self-trained parser's f-score decreased relative to the baseline parser's f-score. We analyzed the distribution of sentences into these classes with respect to four factors: sentence length, the number of unknown words (i.e., words not appearing in sections 2-21 of the Penn treebank) in the sentence, the number of coordinating conjunctions (CC) in the sentence, and the number of prepositions (IN) in the sentence. The distributions of classes (better, worse, no change) with respect to each of these factors individually are graphed in Figures 2 to 5.</Paragraph> <Paragraph position="1"> Figure 2 shows how the self-training affects f-score as a function of sentence length. The top line as a function of sentence length shows that the f-score of most sentences remain unchanged. The middle line is the number of sentences that improved their f-score, and the bottom are those which got worse. So, for example, for sentences of length 30, about 80 were unchanged, 25 improved, and 22 worsened. It seems clear that there is no improvement for either very short sentences, or for very long ones. (For long ones the graph is hard to read. We show a regression analysis later in this section that confirms this statement.) While we did not predict this effect, in retrospect it seems reasonable. The parser was already doing very well on short sentences. The very long ones are hopeless, and the middle ones are just right. We call this the Goldilocks effect.</Paragraph> <Paragraph position="2"> As for the other three of these graphs, their stories are by no means clear. Figure 3 seems to indicate that the number of unknown words in the sentence does not predict that the reranker will help. Figure 4 might indicate that the self-training parser improves prepositional-phrase attachment, but the graph looks suspiciously like that for sentence length, so the improvements might just be due to the Goldilocks effect. Finally, the improvement in Figure 5 is hard to judge.</Paragraph> <Paragraph position="3"> To get a better handle on these effects we did a factor analysis. The factors we consider are number of CCs, INs, and unknowns, plus sentence length.</Paragraph> <Paragraph position="4"> As Figure 2 makes clear, the relative performance of the self-trained and baseline parsers does not self-trained parser improve the parse with the highest probability vary linearly with sentence length, so we introduced binned sentence length (with each bin of length 10) as a factor.</Paragraph> <Paragraph position="5"> Because the self-trained and baseline parsers produced equivalent output on 3,346 (66%) of the sentences, we restricted attention to the 1,693 sentences on which the self-trained and baseline parsers' f-scores differ. We asked the program to consider the following factors: binned sentence length, number of PPs, number of unknown words, and number of CCs. The results are shown in Table 5. The factor analysis is trying to model the log odds as a sum of linearly weighted factors. I.e, as a function of number of prepositions tor. The second the change in the log-odds resulting from this factor being present (in the case of CCs and INs, multiplied by the number of them) and the last column is the probability that this factor is really non-zero.</Paragraph> <Paragraph position="6"> Note that there is no row for either PPs or unknown words. This is because we also asked the program to do a model search using the Akaike Information Criterion (AIC) over all single and pairwise factors. The model it chooses predicts that the self-trained parser is likely produce a better parse than the baseline only for sentences of length 20-40 or sentences containing several CCs. It did not include the number of unknown words and the number of INs as factors because they did not receive a weight significantly different from zero, and the AIC model search dropped them as factors from the model.</Paragraph> <Paragraph position="7"> In other words, the self-trained parser is more likely to be correct for sentences of length 20-40 and as the number of CCs in the sentence increases. The self-trained parser does not improve prepositional-phrase attachment or the handling of unknown words.</Paragraph> <Paragraph position="8"> This result is mildly perplexing. It is fair to say that neither we, nor anyone we talked to, thought conjunction handling would be improved. Conjunctions are about the hardest things in parsing, and we have no grip on exactly what it takes to help parse them. Conversely, everyone expected improvements on unknown words, as the self-training should dras- null as a function of number of conjunctions tically reduce the number of them. It is also the case that we thought PP attachment might be improved because of the increased coverage of prepositionnoun and preposition-verb combinations that work such as (Hindle and Rooth, 1993) show to be so important. null Currently, our best conjecture is that unknowns are not improved because the words that are unknown in the WSJ are not significantly represented in the LA Times we used for self-training. CCs are difficult for parsers because each conjunct has only one secure boundary. This is particularly the case with longer conjunctions, those of VPs and Ss.</Paragraph> <Paragraph position="9"> One thing we know is that self-training always improves performance of the parsing model when used as a language model. We think CC improvement is connected with this fact and our earlier point that the probabilities of the 50-best parses are becoming more skewed. In essence the model is learning, in general, what VPs and Ss look like so it is becoming easier to pull them out of the stew surrounding the conjunct. Conversely, language modeling has comparatively less reason to help PP attachment. As long as the parser is doing it consistently, attaching the PP either way will work almost as well.</Paragraph> </Section> </Section> class="xml-element"></Paper>