File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2204_intro.xml
Size: 2,368 bytes
Last Modified: 2025-10-06 14:04:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2204"> <Title>Transductive Pattern Learning for Information Extraction</Title> <Section position="3" start_page="25" end_page="25" type="intro"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> A review of machine learning for information extraction is beyond the scope of this paper; see e.g. (Cardie, 1997; Kushmerick and Thomas, 2003).</Paragraph> <Paragraph position="1"> A number of researchers have previously developed bootstrapping or semi-supervised approaches to information extraction, named entity recognition, and related tasks (Riloff, 1996; Brin, 1998; Riloff and Jones, 1999; Agichtein et al., 2001; Yangarber et al., 2002; Stevenson and Greenwood, 2005; Etzioni et al., 2005).</Paragraph> <Paragraph position="2"> Several approaches for learning from both labeled and unlabeled data have been proposed (Yarowsky, 1995; Blum and Mitchell, 1998; Collins and Singer, 1999) where the unlabeled data is utilised to boost the performance of the algorithm. In (Collins and Singer, 1999) Collins and Singer show that unlabeled data can be used to reduce the level of supervision required for named entity classification. However, their approach is reliant on the presence of redundancy in the named entities to be identified.</Paragraph> <Paragraph position="3"> TPLEX is most closely related to the NOMEN algorithm (Yangarber et al., 2002). NOMEN has a very simple iterative structure: at each step, a very small number of high-quality new fragments are extracted, which are treated in the next step as equivalent to seeds from the labeled documents. NOMEN has a number of parameters which must be carefully tuned to ensure that it does not over-generalise. Erroneous additions to the set of trusted fragments can lead to a snowballing of errors.</Paragraph> <Paragraph position="4"> Also, NOMEN uses a binary scoring mechanism, which works well in dense corpora with substantial redundancy. However, manyinformationextractiontasks feature sparse corpora with little or no redundancy. We have extended NOMEN by allowing it to make finergrained(asopposedtobinary)scoringdecisionsateach null iteration. Instead of definitively assigning a position to a given field, we calculate the likelihood that it belongs to the field over multiple iterations.</Paragraph> </Section> class="xml-element"></Paper>