File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2094_intro.xml
Size: 3,526 bytes
Last Modified: 2025-10-06 14:03:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2094"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics On-Demand Information Extraction</Title> <Section position="4" start_page="0" end_page="732" type="intro"> <SectionTitle> 2 Overview </SectionTitle> <Paragraph position="0"> The basic functionality of the system is the following. The user types a query / topic description in keywords (for example, &quot;merge&quot; or &quot;merger&quot;). Then tables will be created automatically in several minutes, rather than in a month of human labor. These tables are expected to show information about the salient relations for the topic.</Paragraph> <Paragraph position="1"> Figure 1 describes the components and how this system works. There are six major components in the system. We will briefly describe each component and how the data is processed; then, in the next section, four important components will be described in more detail.</Paragraph> <Paragraph position="2"> 1) IR system: Based on the query given by the user, it retrieves relevant documents from the document database. We used a simple TF/IDF IR system we developed.</Paragraph> <Paragraph position="3"> 2) Pattern discovery: First, the texts in the retrieved documents are analyzed using a POS tagger, a dependency analyzer and an Extended NE (Named Entity) tagger, which will be described later. Then this component extracts sub-trees of dependency trees which are relatively frequent in the retrieved documents compared to the entire corpus. It counts the frequencies in the retrieved texts of all sub-trees with more than a certain number of nodes and uses TF/IDF methods to score them. The top-ranking sub-trees which contain NEs will be called patterns, which are expected to indicate salient relationships of the topic and will be used in the later components.</Paragraph> <Paragraph position="4"> 3) Paraphrase discovery: In order to find semantic relationships between patterns, i.e. to find patterns which should be used to build the same table, we use paraphrase discovery techniques.</Paragraph> <Paragraph position="5"> The paraphrase discovery was conducted off-line and created a paraphrase knowledge base. 4) Table construction: In this component, the patterns created in (2) are linked based on the paraphrase knowledge base created by (3), producing sets of patterns which are semantically equivalent. Once the sets of patterns are created, these patterns are applied to the documents retrieved by the IR system (1). The matched patterns pull out the entity instances and these entities are aligned to build the final tables.</Paragraph> <Paragraph position="6"> 5) Language analyzers: We use a POS tagger and a dependency analyzer to analyze the text. The analyzed texts are used in pattern discovery and paraphrase discovery.</Paragraph> <Paragraph position="7"> 6) Extended NE tagger: Most of the participants in events are likely to be Named Entities.</Paragraph> <Paragraph position="8"> However, the traditional NE categories are not sufficient to cover most participants of various events. For example, the standard MUC's 7 NE categories (i.e. person, location, organization, percent, money, time and date) miss product names (e.g. Windows XP, Boeing 747), event names (Olympics, World War II), numerical expressions other than monetary expressions, etc. We used the Extended NE categories with 140 categories and a tagger based on the categories.</Paragraph> </Section> class="xml-element"></Paper>