File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-2012_metho.xml

Size: 4,198 bytes

Last Modified: 2025-10-06 14:09:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-2012">
  <Title>Extracting Information about Outbreaks of Infectious Epidemics</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PLUS Epidemiological Fact Base. The
</SectionTitle>
    <Paragraph position="0"> facts are automatically extracted from plain-text reports about outbreaks of infectious epidemics around the world. The system collects new reports, extracts new facts, and updates the database, in real time. The extracted database is available on-line through a Web server.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Information Extraction (IE) is a technology for finding facts in plain text, and coding them in a logical representation, such as a relational database.</Paragraph>
    <Paragraph position="1"> Much published work on IE reports on &amp;quot;closed&amp;quot; experiments; systems are built and evaluated based on carefully annotated corpora, at most a few hundred documents.1 The goal of the work presented here is to explore the IE process in the large: the system integrates a number of off-line and on-line components around the core IE engine, and serves as a base for research on a wide range of problems.</Paragraph>
    <Paragraph position="2"> The system is applied to a large dynamic collection of documents in the epidemiological domain, containing tens of thousands of documents. The topic is outbreaks of infectious epidemics, affecting humans, animals and plants. To our knowledge, this is the first large-scale IE database in the epidemiological domain publicly accessible on-line.2 1Cf., e.g., the MUC and ACE IE evaluation programmes.</Paragraph>
    <Paragraph position="3"> 2On-line IE databases do exist, e.g., CiteSeer, but none that extract multi-argument events from plain natural-language text.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="22" type="metho">
    <SectionTitle>
2 System Description
</SectionTitle>
    <Paragraph position="0"> The architecture of the ProMED-PLUS system3 is shown in Fig. 1. The core IE Engine (center) is implemented as a sequence, or &amp;quot;pipeline,&amp;quot; of stages:  The IE engine is based in part on earlier work, (Grishman et al., 2003). Novel components use machine learning at several stages to enhance the performance of the system and the quality of the extracted data: acquisition of domain knowledge for populating the knowledge bases (left side in Fig. 1), and automatic post-validation of extracted facts for detecting and reducing errors (upper right). Novel features include the notion of confidence,5 and aggregation of separate facts into outbreaks across multiple reports, based on confidence.</Paragraph>
    <Paragraph position="1"> Operating in the large is essential, because the learning components in the system rely on the  acquisition, (Yangarber et al., 2002; Yangarber, 2003) requires a large corpus of domain-specific and general-topic texts. On the other hand, automatic error reduction requires a critical mass of extracted facts. Tighter integration between IE and KDD components, for mutual benefit, is advocated in recent related research, e.g., (Nahm and Mooney, 2000; McCallum and Jensen, 2003). In this system we have demonstrated that redundancy in the extracted data (despite the noise) can be leveraged to improve quality, by analyzing global trends and correcting erroneous fills which are due to local mis-analysis, (Yangarber and Jokipii, 2005). For this kind of approach to work, it is necessary to aggregate over a large body of extracted records.</Paragraph>
    <Paragraph position="2"> The interface to the DB is accessible on-line at doremi.cs.helsinki.fi/plus/(lower-right of Fig. 1). It allows the user to view, select and sort the extracted outbreaks, as well as the individual incidents that make up the aggregated outbreaks. All facts in the database are linked back to the original reports from which they were extracted. The distribution of the outbreaks may also be plotted and queried through the Geographic Map view.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML