File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1043_intro.xml

Size: 3,799 bytes

Last Modified: 2025-10-06 14:03:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1043">
  <Title>Reranking and Self-Training for Parser Adaptation</Title>
  <Section position="3" start_page="0" end_page="337" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Modern statistical parsers require treebanks to train their parameters, but their performance declines when one parses genres more distant from the training data's domain. Furthermore, the tree-banks required to train said parsers are expensive and difficult to produce.</Paragraph>
    <Paragraph position="1"> Naturally, one of the goals of statistical parsing is to produce a broad-coverage parser which is relatively insensitive to textual domain. But the lack of corpora has led to a situation where much of the current work on parsing is performed on a single domain using training data from that domain -- the Wall Street Journal (WSJ) section of the Penn Treebank (Marcus et al., 1993). Given the aforementioned costs, it is unlikely that many significant treebanks will be created for new genres.</Paragraph>
    <Paragraph position="2"> Thus, parser adaptation attempts to leverage existing labeled data from one domain and create a parser capable of parsing a different domain.</Paragraph>
    <Paragraph position="3"> Unfortunately, the state of the art in parser portability (i.e. using a parser trained on one domain to parse a different domain) is not good. The &amp;quot;Charniak parser&amp;quot; has a labeled precision-recall f-measure of 89.7% on WSJ but a lowly 82.9% on the test set from the Brown corpus treebank.</Paragraph>
    <Paragraph position="4"> Furthermore, the treebanked Brown data is mostly general non-fiction and much closer to WSJ than, e.g., medical corpora would be. Thus, most work on parser adaptation resorts to using some labeled in-domain data to fortify the larger quantity of out-of-domain data.</Paragraph>
    <Paragraph position="5"> In this paper, we present some encouraging results on parser adaptation without any in-domain data. (Though we also present results with in-domain data as a reference point.) In particular we note the effects of two comparatively recent techniques for parser improvement.</Paragraph>
    <Paragraph position="6"> The first of these, parse-reranking (Collins, 2000; Charniak and Johnson, 2005) starts with a &amp;quot;standard&amp;quot; generative parser, but uses it to generate the n-best parses rather than a single parse. Then a reranking phase uses more detailed features, features which would (mostly) be impossible to incorporate in the initial phase, to reorder  the list and pick a possibly different best parse.</Paragraph>
    <Paragraph position="7"> At first blush one might think that gathering even more fine-grained features from a WSJ treebank would not help adaptation. However, we find that reranking improves the parsers performance from 82.9% to 85.2%.</Paragraph>
    <Paragraph position="8"> The second technique is self-training -- parsing unlabeled data and adding it to the training corpus. Recent work, (McClosky et al., 2006), has shown that adding many millions of words of machine parsed and reranked LA Times articles does, in fact, improve performance of the parser on the closely related WSJ data. Here we show that it also helps the father-afield Brown data. Adding it improves performance yet-again, this time from 85.2% to 87.8%, for a net error reduction of 28%. It is interesting to compare this to our results for a completely Brown trained system (i.e. one in which the first-phase parser is trained on just Brown training data, and the second-phase reranker is trained on Brown 50-best lists). This system performs at a 88.4% level -- only slightly higher than that achieved by our system with only WSJ data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML