File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1070_intro.xml
Size: 2,730 bytes
Last Modified: 2025-10-06 14:01:25
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1070"> <Title>Inducing Information Extraction Systems for New Languages via Cross-Language Projection</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Information Extraction </SectionTitle> <Paragraph position="0"> The goal of information extraction systems is to identify and extract facts from natural language text.</Paragraph> <Paragraph position="1"> IE systems are usually designed for a specific domain, and the types of facts to be extracted are defined in advance. In this paper, we will focus on the domain of plane crashes and will try to extract descriptions of the vehicle involved in the crash, victims of the crash, and the location of the crash.</Paragraph> <Paragraph position="2"> Most IE systems use some form of extraction patterns to recognize and extract relevant information. Many techniques have been developed to generate extraction patterns for a new domain automatically, including PALKA (Kim & Moldovan, 1993), AutoSlog (Riloff, 1993), CRYSTAL (Soderland et al., 1995), RAPIER (Califf, 1998), SRV (Freitag, 1998), meta-bootstrapping (Riloff & Jones, 1999), and ExDisco (Yangarber et al., 2000). For this work, we will use AutoSlog-TS (Riloff, 1996b) to generate IE patterns for the plane crash domain.</Paragraph> <Paragraph position="3"> AutoSlog-TS is a derivative of AutoSlog that automatically generates extraction patterns by gathering statistics from a corpus of relevant texts (within the domain) and irrelevant texts (outside the domain).</Paragraph> <Paragraph position="4"> Each extraction pattern represents a linguistic expression that can extract noun phrases from one of three syntactic positions: subject, direct object, or object of a prepositional phrase. For example, the following patterns could extract vehicles involved in a plane crash: &quot;<subject> crashed&quot;, &quot;hijacked <direct-object>&quot;, and &quot;wreckage of <np>&quot;.</Paragraph> <Paragraph position="5"> We trained AutoSlog-TS using AP news stories about plane crashes as the relevant text, and AP news stories that do not mention plane crashes as the irrelevant texts. AutoSlog-TS generates a list of extraction patterns, ranked according to their association with the domain. A human must review this list to decide which patterns are useful for the IE task and which ones are not. We manually reviewed the top patterns and used the accepted patterns for the experiments described in this paper. To apply the extraction patterns to new text, we used a shallow parser called Sundance that also performs information extraction.</Paragraph> </Section> class="xml-element"></Paper>