File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1098_metho.xml

Size: 5,706 bytes

Last Modified: 2025-10-06 14:13:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1098">
  <Title>THE TIPSTER/SHOGUN PROJECT</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE TIPSTER/SHOGUN PROJECT
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
GE Research and Development Center
1 River Rd.
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PROJECT GOALS
</SectionTitle>
    <Paragraph position="0"> TIPSTER/SHOGUN, part of the ARPA TIPSTER Text (Phase I) program, was led by GE Corporate Research and Development, with Carnegie Mellon University and Martin Marietta</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Management and Data Systems (formerly GE Aerospace).
</SectionTitle>
      <Paragraph position="0"> The project ended at the beginning of 1994, with TIPSTER Phase II expected to begin in March. The TIP-STER/SHOGUN system is thus the end result of a two-year research effort. The project's main goals were: (1) to develop algorithms that would advance the state of the art in coverage and accuracy in data extraction, and (2) to demonstrate high performance and adaptability across languages and domains.</Paragraph>
      <Paragraph position="1"> The team concentrated its research on the development of a model of finite-state approximation, within which the performance of more detailed models of language could be realized in a simple, efficient framework, and on automated knowledge acquisition. The ability of programs to extract data from free text is, in general, limited by the coverage of domain and world knowledge. We chose to focus on knowledge acquisition from corpus data, thereby expanding the coverage of the system while also helping to tune each configuration.</Paragraph>
      <Paragraph position="2"> Like other TIPSTER contractors, the TIPSTER/SHOGUN team ran its system on a series of benchmarks, ending with the MUC-5 evaluation in August, 1993. MUC-5 included tests in four configurations, comprised of two domains (joint ventures and micro-electronics) in each of two languages (English and Japanese). Although many of the research results of SHOGUN had little or no impact on the benchmarks, MUC-5 provided a comprehensive test of system performance.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
RECENT RESULTS
</SectionTitle>
    <Paragraph position="0"> The finite-state approximation method developed under TIPSTER was inspired by earlier work at GE and SRI, along with experiments near the mid-point of our project, which showed that tighter control, particularly in parsing, contributed very little to text interpretation while greatly inhibiting knowledge acquisition. This shift was also influenced by the demands of Japanese language processing, where our existing knowledge resources were less refined than in English.</Paragraph>
    <Paragraph position="1"> The relationship between representation, i.e., the finite-state patterns in our system, and acquisition, i.e., the method by which new knowledge is added, is critical. In our system, we chose to emphasize the finite-state patterns in part because they help to take advantage of the most critical source of knowledge we have available--the corpus.</Paragraph>
    <Paragraph position="2"> The corpus-based acquisition strategy used statistical methods to help identify key phrases and other lexical relations in the corpus, and to assign these lexical relations to word groups with similar interpretations. This approach worked best for task components that required large amounts of knowledge, particularly determining the product or service of each joint venture. We believe this accounts for some of the large differences in coverage between SHOGUN and other systems.</Paragraph>
    <Paragraph position="3"> In addition to helping coverage, the corpus-based acquisition strategy greatly eased portability across languages. In most cases, we did each English component first, then used the English as a way of bootstrapping the Japanese. For example, we would take each important &amp;quot;pivot&amp;quot; word in English, try to identify the corresponding &amp;quot;pivot&amp;quot; in Japanese, then use the corpus to identify the relevant contexts in which that word occurred in Japanese. SHOGUN's accuracy in Japanese was somewhat higher, on average, than in English.</Paragraph>
    <Paragraph position="4"> SHOGUN, on average, extracted 37% more information correctly (37% higher recall) than any other system in each of the four MUC-5 configurations. On average, SHOGUN's precision was 13% lower than the next best system. Recall advanced 37% on average between the TIPSTER 18-month evaluation and the MUC-5 test 6 months later, and was 10% higher in the TIPSTER final test than in MUC-4 (which was a much simpler task). We are particularly satisfied by the consistently improving coverage of our system across languages and domains.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="455" type="metho">
    <SectionTitle>
PLANS FOR THE COMING YEAR
</SectionTitle>
    <Paragraph position="0"> As TIPSTER Phase II begins this year, the emphasis will be on developing an architecture that incorporates some of our Phase I results into an open framework that promotes delivery as well as further technical advances, In addition, our research will continue to integrate methods from information retrieval (detection) with more detailed language processing strategies.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML