File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0206_intro.xml

Size: 1,629 bytes

Last Modified: 2025-10-06 14:03:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0206">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Data Selection in Semi-supervised Learning for Name Tagging</Title>
  <Section position="4" start_page="0" end_page="48" type="intro">
    <SectionTitle>
2 Prior Work
</SectionTitle>
    <Paragraph position="0"> This work presented here extends a substantial body of previous work (Blum and Mitchell, 1998; Riloff and Jones, 1999; Ando and Zhang, 2005)  that all focus on reducing annotation requirements. For the specific task of named entity annotation, some researchers have emphasized the creation of taggers from minimal seed sets (Strzalkowski and Wang, 1996; Collins and Singer, 1999; Lin et al., 2003) while another line of inquiry (which we are pursuing) has sought to improve on high-performance baseline taggers (Miller et al., 2004).</Paragraph>
    <Paragraph position="1"> Banko and Brill (2001) suggested that the development of very large training corpora may be most effective for progress in empirical natural language processing. Their experiments show a logarithmic trend in performance as corpus size increases without performance reaching an upper bound. Recent work has replicated their work on thesaurus extraction (Curran and Moens, 2002) and is-a relation extraction (Ravichandran et al., 2004), showing that collecting data over a very large corpus significantly improves system performance. However, (Curran, 2002) and (Curran and Osborne, 2002) claimed that the choice of statistical model is more important than relying upon large corpora.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML