File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/m93-1011_intro.xml
Size: 3,829 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1011"> <Title>GE-CMU: DESCRIPTION OF THE SHOGUN SYSTEM USEDFOR MUC- 5</Title> <Section position="3" start_page="0" end_page="109" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> The GE-CMU TIPSTER/SHOGUN system is the result of a two-year research effort, part of the A1tPA sponsored TIPSTER: data extraction program. The project's main goals were : (1) to develop algorithm s that would advance the state of the art in coverage and accuracy in data extraction, and (2) to demonstrat e high performance across languages and domains and to develop methods for easing the adaptation of the system to new languages and domains .</Paragraph> <Paragraph position="1"> The system as used in MUC-5 represents a considerable shift from those used in earlier stages of th e program and in previous MUC 's. The original SHOGUN design integrated several different approaches b y combining different knowledge sources, such as syntax, semantics, phrasal rules, and domain knowledge, a t run-time . This allowed the system to achieve a good level of performance very quickly, and made it easy t o test different modules and methods ; however, it proved very difficult to make all the changes necessary t o improve the system, especially across languages, when system knowledge was so distributed at run-time .</Paragraph> <Paragraph position="2"> As a result, the team adopted a new approach, relying heavily on finite-state approximation . This method combines several earlier previous of work, including Pereira's research on grammar approximation [4], som e of the original ideas on parser compilation from Tomita [5], and GE 's representation of the dynamic lexico n [3, 1] . Like Pereira 's model, the system uses a finite-state grammar as a loose version of a context fre e 1 This research was sponsored (in part) by the Advanced Research Project Agency (DOD) and other government agencies . The views and conclusions contained in this document are those of the authors and should not be interpreted as representin g the official policies, either expressed or implied, of the Advanced Research Project Agency or the US Government .</Paragraph> <Paragraph position="3"> grammar, under the assumption that the finite state grammar will cover all the inputs that the genera l grammar would recognize but perhaps be more tolerant . However, the system also includes methods fo r compiling different knowledge sources into the finite state model, particularly emphasizing lexical knowledg e and domain knowledge as reflected in a corpus .</Paragraph> <Paragraph position="4"> This model, in which knowledge is combined at development time to be used by a finite-state patter n snatching engine at run-time, makes it easier to tune the system to a new language or domain withou t sacrificing the benefit of having general linguistic and conceptual knowledge in the system .</Paragraph> <Paragraph position="5"> While the GE systems, and more recently, the GE-CMU systems, have done well in all the MUC evaluations, our rate of progress has never been so great as it has been in the period before MUC-5 . This is in spit e of the fact that, the team's diagnostic and debugging efforts had to be divided across languages and domain s (handling Japanese, for example, presented a significant overhead in simply being able to follow the rule s and analyze the results) . We attribute this progress to the current focus on facilitating and automating th e knowledge acquisition process, especially on the use of a corpus.</Paragraph> <Paragraph position="6"> This paper will give a very brief overview of the configuration of the system, followed by the analysis o f the examples, and some conclusions about the results .</Paragraph> </Section> class="xml-element"></Paper>