File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1089_metho.xml
Size: 5,036 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1089"> <Title>SHOGUN- MULTILINGUAL DATA EXTRACTION FOR TIPSTER</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> SHOGUN- MULTILINGUAL DATA EXTRACTION FOR TIPSTER </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> GE Research and Development Center 1 River Rd., Schenectady, NY 12301 PROJECT GOALS </SectionTitle> <Paragraph position="0"> The TIPSTER/SHOGUN project aims at substantive improvements in coverage and accuracy for automatic data extraction through innovative strategies in knowledge acquisition, run-time integration, and control. One of four teams in the data extraction component of the TIPSTER program, TIPSTER/SHOGUN includes GE Corporate Research and Development, Carnegie Mellon</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> University - Center for Machine Translation, and GE Management and Data Systems. </SectionTitle> <Paragraph position="0"> Data extraction systems interpret the key content of natural language text, producing a structured representation of items that range from high-level business relationships to detailed knowledge coding of technologies and industry classifications. This task applies to both Japanese and English in each of two domains--joint ventures and micro-electronics. As such, TIPSTER is considerably more detailed and comprehensive than previous text interpretation experiments, including prior MUC (Message Understanding Conference) evaluations.</Paragraph> <Paragraph position="1"> The goals for SHOGUN are the following: * Accuracy significantly ahead of MUC-4, with levels near those of trained human analysts at about 100 times human speed using conventional hardware and software.</Paragraph> <Paragraph position="2"> * Automated knowledge acquisition and extensibility tools that support customization times of a few weeks for new applications.</Paragraph> <Paragraph position="3"> * Multi-lingual performance, with comparable levels in both languages and the highest possible overlap between languages.</Paragraph> <Paragraph position="4"> The project is now within a few months of completion, and is on target toward all of these goals.</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> RECENT RESULTS </SectionTitle> <Paragraph position="0"> During the early stages of the project, the team reached very good initial levels of performance on MUC-4 by successfidly integrating methods used at GE and CMU, marking the first time that parsing systems of this level of coverage have been effectively combined. This provided an important testbed and also allowed for multi-lingual development--In recent months, Japanese performance has remained close to English performance.</Paragraph> <Paragraph position="1"> As the system coverage and accuracy have continued to improve, the most important recent thrust has been the incorporation of most of the knowledge and control strategies of the system into a finite-state driven analyzer, effectively replacing the traditional parsing layer with a detailed knowledge base of finite-state rules compiled from syntactic and lexical resources. While this seems close to work done in the speech community, it is an unusual approach for text, where the high perplexity and long sentence length have seemed to favor semantics-driven and high-level syntactic models.</Paragraph> <Paragraph position="2"> The finite-state model allows different knowledge sources, particularly corpus-based knowledge, to have more of an impact on interpretation. Data extraction is a knowledge-intensive task, and it has been much simpler to augment the finite-state rules with corpus data than it was for the more abstract rules.</Paragraph> <Paragraph position="3"> While the performance on all tasks still lags behind human analysts, closing this gap may not be as hard as we first expected. Much of the difference comes from portions of the work that are still incomplete. In addition, the ability to use automatically-acquired corpus data gives the programs a distinct advantage on certain portions of the task.</Paragraph> </Section> <Section position="4" start_page="0" end_page="395" type="metho"> <SectionTitle> PLANS FOR THE COMING YEAR </SectionTitle> <Paragraph position="0"> As the project nears completion, the team is approaching the goal of near-human accuracy mostly by finishing certain key details, such as better reference resolution and word sense discrimination. At the same time, we are close to some significant advances in corpus-based training methods that will not only isolate the context required to discriminate nuances of meaning but also significantly reduce development, time by acquiring domain knowledge from the x:orpus. This may the key to future applications of TIPSTER technology.</Paragraph> </Section> class="xml-element"></Paper>