File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1051_intro.xml

Size: 3,384 bytes

Last Modified: 2025-10-06 14:06:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1051">
  <Title>Mixed-Initiative Development of Language Processing Systems</Title>
  <Section position="2" start_page="0" end_page="348" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In the absence of complete and deep text understanding, implementing information extraction systems remains a delicate balance between general theories of language processing and domain-specific heuristics. Recent developments in the area of corpus-based language processing systems indicate that the successful application of any system to a new task depends to a very large extent on the careful and frequent evaluation of the evolving system against training and test corpora.</Paragraph>
    <Paragraph position="1"> This has focused increased attention on the importance of obtaining reliable training corpora. Unfortunately, acquiring such data has usually been a labor-intensive and time-consuming exercise.</Paragraph>
    <Paragraph position="2"> The goal of the Alembic Workbench is to dramatically accelerate the process by which language processing systems are tailored to perform new tasks. The philosophy motivating our work is to maximally reuse and re-apply every kernel of knowledge available at each step of the tailoring process. In particular, our approach applies a bootstrapping procedure to the development of the training corpus itself. By re-investing the knowledge available in the earliest training data to pre-tag subsequent un-tagged data, the Alembic Workbench can tralasform the process of manual tagging to one dominated by manual review. In the limit, if the pre-tagging process performs well enough, it becomes the domain-specific automatic tagging procedure itself, and can be applied to those new documents from which information is to be extracted.</Paragraph>
    <Paragraph position="3"> As we and others in the information extraction arena have noticed, the quality of text processing heuristics is influenced critically not only by the power of one's linguistic theory, but also by the ability to evaluate those theories quickly and reliably. Therefore, building new information extraction systems requires an integrated environment that supports: (1) the development of a domain-specific annotated corpus; (2) the multi-faceted analysis of that corpus; (3) the ability to quickly generate hypotheses as to how to extract or tag information in that corpus; and (4) the ability to quickly evaluate and analyze the performance of those hypotheses. The Alembic Workbench is our attempt to build such an environment.</Paragraph>
    <Paragraph position="4"> As the Message Understanding Conferences move into their tenth year, we have seen a growing recognition of the value of balanced evaluations against a common test corpus. What is unique in our approach is to integrate system development with the corpus annotation process itself. The early indications are that at the very least this integration can significantly increase the productivity of the corpus annotator. We believe that the benefits will flow in the other direction as well, and that a concomitant increase in system performance will follow as one applies the same mixed-initiative development environment to the problem of domain-specific tailoring of the language processing system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML