File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1046_intro.xml
Size: 3,210 bytes
Last Modified: 2025-10-06 14:03:03
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1046"> <Title>Unsupervised Learning of Field Segmentation Models for Information Extraction</Title> <Section position="2" start_page="0" end_page="371" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Information extraction is potentially one of the most useful applications enabled by current natural language processing technology. However, unlike general tools like parsers or taggers, which generalize reasonably beyond their training domains, extraction systems must be entirely retrained for each application. As an example, consider the task of turning a set of diverse classified advertisements into a queryable database; each type of ad would require tailored training data for a supervised system. Approaches which required little or no training data would therefore provide substantial resource savings and extend the practicality of extraction systems.</Paragraph> <Paragraph position="1"> The term information extraction was introduced in the MUC evaluations for the task of finding short pieces of relevant information within a broader text that is mainly irrelevant, and returning it in a structured form. For such &quot;nugget extraction&quot; tasks, the use of unsupervised learning methods is difficult and unlikely to be fully successful, in part because the nuggets of interest are determined only extrinsically by the needs of the user or task. However, the term information extraction was in time generalized to a related task that we distinguish as field segmentation. In this task, a document is regarded as a sequence of pertinent fields, and the goal is to segment the document into fields, and to label the fields. For example, bibliographic citations, such as the one in Figure 1(a), exhibit clear field structure, with fields such as author, title, and date. Classified advertisements, such as the one in Figure 1(b), also exhibit field structure, if less rigidly: an ad consists of descriptions of attributes of an item or offer, and a set of ads for similar items share the same attributes. In these cases, the fields present a salient, intrinsic form of linguistic structure, and it is reasonable to hope that field segmentation models could be learned in an unsupervised fashion.</Paragraph> <Paragraph position="2"> In this paper, we investigate unsupervised learning of field segmentation models in two domains: bibliographic citations and classified advertisements for apartment rentals. General, unconstrained induction of HMMs using the EM algorithm fails to detect useful field structure in either domain. However, we demonstrate that small amounts of prior knowledge can be used to greatly improve the learned model. In both domains, we found that unsupervised methods can attain accuracies with 400 unlabeled examples comparable to those attained by supervised methods on 50 labeled examples, and that semi-supervised methods can make good use of small amounts of labeled data.</Paragraph> <Paragraph position="3"> apartment rentals shown in (b) exhibit field structure. Contrast these to part-of-speech tagging in (c) which does not.</Paragraph> </Section> class="xml-element"></Paper>