File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/w95-0112_intro.xml
Size: 2,451 bytes
Last Modified: 2025-10-06 14:05:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0112"> <Title>Automatically Acquiring Conceptual Patterns Without an Annotated Corpus</Title> <Section position="2" start_page="0" end_page="148" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the last few years, significant progress has been made toward automatically acquiring conceptual patterns for information extraction (e.g., \[Riloff, 1993; Kim and Moldovan, 1993\]). However, previous approaches require an annotated training corpus or some other type of manually encoded training data. Annotated training corpora are expensive to build, both in terms of the time and the expertise required to create them. Furthermore, training corpora for information extraction are typically annotated with domain-specific tags, in contrast to general-purpose annotations such as part-of-speech tags or noun-phrase bracketing (e.g., the Brown Corpus \[Francis and Kucera, 1982\] and the Penn Treebank \[Marcus et al., 1993\]). Consequently, a new training corpus must be annotated for each domain.</Paragraph> <Paragraph position="1"> We have begun to explore the possibility of using an untagged corpus to automatically acquire conceptual patterns for information extraction. Our approach uses a combination of domain-independent linguistic rules and statistics. The linguistic rules are based on our previous system, AutoSlog \[Riloff, 1993\], which automatically constructs dictionaries for information extraction using an annotated training corpus. We have put a new spin on the original system by applying it exhaustively to an untagged but preclassified training corpus (i.e., a corpus in which the texts have been manually classified as either relevant or irrelevant). Statistics are then used to sift through the myriad of patterns that it produces. The new system, AutoSlog-TS, can generate a conceptual dictionary of extraction patterns for a domain from a preclassified text corpus.</Paragraph> <Paragraph position="2"> First, we give a brief overview of information extraction and the CIRCUS sentence analyzer that we used in these experiments. Second, we describe the original AutoSlog system for automated dictionary construction and explain how AutoSlog was adapted to generate patterns from an untagged corpus. Next, we present empirical results from experiments with AutoSlog-TS using the MUC-4 text corpus.</Paragraph> </Section> class="xml-element"></Paper>