File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1067_intro.xml
Size: 3,317 bytes
Last Modified: 2025-10-06 14:06:34
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1067"> <Title>Toward General-Purpose Learning for Information Extraction</Title> <Section position="3" start_page="404" end_page="404" type="intro"> <SectionTitle> 2 SRV </SectionTitle> <Paragraph position="0"> In order to be suitable for the widest possible variety of textual domains, including collections made up of informal E-mail messages, World Wide Web pages, or netnews posts, a learner must avoid any assumptions about the structure of documents that might be invalidated by new domains. It is not safe to assume, for example, that text will be grammatical, or that all tokens encountered will have entries in a lexicon available to the system. Fundamentally, a document is simply a sequence of terms. Beyond this, it becomes difficult to make assumptions that are not violated by some common and important domain of interest.</Paragraph> <Paragraph position="1"> At the same time, however, when structural assumptions are justified, they may be critical to the success of the system. It should be possible, therefore, to make structural information available to the learner as input for training. The machine learning method with which we experiment here, SRV, was designed with these considerations in mind. In experiments reported elsewhere, we have applied SRV to collections of electronic seminar announcements and World Wide Web pages (Freitag, 1998). Readers interested in a more thorough description of SRV are referred to (Freitag, 1998). Here, we list its most salient characteristics: * Lack of structural assumptions. SRV assumes nothing about the structure of a field instance 1 or the text in which it is embedded--only that an instance is an unbroken fragment of text. During learning and prediction, SRV inspects every fragment of appropriate size.</Paragraph> <Paragraph position="2"> * Token-oriented features. Learning is guided by a feature set which is separate from the core algorithm. Features describe aspects of individual tokens, such as capitalized, numeric, noun. Rules can posit feature values for individual tokens, or for all tokens in a fragment, and can constrain the ordering and positioning of tokens.</Paragraph> <Paragraph position="3"> newswire article about a corporate acquisition, for example, a field instance might be the text fragment listing the amount paid as part of the deal.</Paragraph> <Paragraph position="4"> a notion of relational features, such as next-token, which map a given token to another token in its environment. SRV uses such features to explore the context of fragments under investigation.</Paragraph> <Paragraph position="5"> * Top-down greedy rule search. SRV constructs rules from general to specific, as in FOIL (Quinlan, 1990). Top-down search is more sensitive to patterns in the data, and less dependent on heuristics, than the bottom-up search used by similar systems (Soderland, 1996; Califf and Mooney, 1997).</Paragraph> <Paragraph position="6"> * Rule validation. Training is followed by validation, in which individual rules are tested on a reserved portion of the training documents. Statistics collected in this way are used to associate a confidence with each prediction, which are used to manipulate the accuracy-coverage trade-off.</Paragraph> </Section> class="xml-element"></Paper>