File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1051_metho.xml

Size: 29,240 bytes

Last Modified: 2025-10-06 14:14:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1051">
  <Title>Mixed-Initiative Development of Language Processing Systems</Title>
  <Section position="3" start_page="348" end_page="349" type="metho">
    <SectionTitle>
2. Alembic Workbench: A brief description
</SectionTitle>
    <Paragraph position="0"> The Alembic Workbench provides a graphical user interface by which texts can be annotated using the mouse and user-defined key bindings. The Workbench mouse interface is engineered specifically to minimize hand motion. This allows text markup to proceed very quickly. Once a text has been marked up, the user's annotations are highlighted in colors specified by the user. A &amp;quot;mouse line&amp;quot; at the bottom of the text window provides further visual feedback indicating all of the annotations associated with the location under the mouse cursor, including document structure markup, if available. An example screen image from a typical session with the Workbench is shown above.</Paragraph>
    <Paragraph position="1"> Our focus in building the Alembic Workbench is to provide a natural but powerful environment for annotating texts in the service of developing natural language processing systems. To this end we have incorporated a growing number of analysis and reporting features. The current set of utilities includes:  * A string-matching mechanism that can automatically replicate new markup to identical instances elsewhere in the document.</Paragraph>
    <Paragraph position="2"> * A rule language for constructing task-specific phrase tagging and/or pre-tagging rule sets.</Paragraph>
    <Paragraph position="3"> * A tool that generates phrase-based KWlC (&amp;quot;keyword in context&amp;quot;) reports to help the user identify common patterns in the markup.</Paragraph>
    <Paragraph position="4"> * A procedure that generates word lists based on their frequency. This tool also measures the degree to which a word occurs in different markup contexts.</Paragraph>
    <Paragraph position="5">  * A visualization component for viewing inter-annotator (or key/answer) agreement.</Paragraph>
    <Paragraph position="6"> * A scorer that allows arbitrary SGML markup to be selected for scoring.</Paragraph>
    <Paragraph position="7"> * A full-featured interface to the multi-stage architecture of the Alembic text processing system. * An interface to Alembic's phrase-rule learning system for generating new application-specific rule sets.</Paragraph>
    <Paragraph position="8"> * The Alembic Workbench also provides specialized  interfaces for supporting more complex, linked markup such as that needed for coreference. Another interface is geared towards capturing arbitrary n-ary relations between tagged elements in a text (these have been called &amp;quot;Scenario Templates&amp;quot; in MUC). More details about the implementation of the Workbench are provided in Section 7.</Paragraph>
    <Paragraph position="9"> The development of the Alembic Workbench environment came about as a result of MYrRE's efforts at refining and modifying our natural language processing system, Alembic \[1,7\], to new tasks: the Message Understanding Conferences (MUC5 and MUC6), and the</Paragraph>
  </Section>
  <Section position="4" start_page="349" end_page="349" type="metho">
    <SectionTitle>
TIPSTER Multi-lingual Entity Task (MET1). (See \[6\]
</SectionTitle>
    <Paragraph position="0"> for an overview and history of MUC6 and the ''Named Entity Task&amp;quot;.) The Alembic text processing system applies Eric Brill's notion of ru/e sequences \[2,3\] at almost every one of its processing stages, from part-of-speech tagging to phrase tagging, and even to some portions of semantic interpretation and inference.</Paragraph>
    <Paragraph position="1"> While its name indicates its lineage, we do not view the Alembic Workbench as wetkkxt to the Alembic text processing system alone. We intend to provide a well-documented API in the near future for external utilities to be incorporated smoothly into the corpus/system development environment. We envision two classes of external utilities: tagging utilities and analysis utilities. By integrating other tagging modules (including complete NLP systems), we hope those systems can be more efficiently customized when the cycle of analysis, hypothesis generation and testing is tightened into a well-integrated loop. The current version of the tool supports viewing, annotating and analyzing documents in 7-bit, 8-bit and 2-byte character sets. Current support includes the Latin-1 languages, Japanese (JIS), Chinese (GB1232), Russian, Greek and Thai.</Paragraph>
    <Paragraph position="2"> 3. Increasing manual annotation productivity through pre-tagging A motivating idea in the design of the Alembic Workbench is to apply any available information as early and as often as possible to reduce the burden of manual tagging. In addition to careful interface design and support for user-customization, a core mechanism for enhancing this process is through pre-tagging.</Paragraph>
    <Paragraph position="3"> The generation of reliably tagged text corpora requires that a human annotator read and certify all of the annotations applied to a document. This is especially true if the annotations are to be used for subsequent manual or automatic training procedures. However, much of the drudgery of this process can be removed if the most obvious and/or oft-repeated expressions can be tagged prior to the annotator's efforts. One way of doing this is to apply tags to any and all strings in a document that match a given string. This is the nature of the &amp;quot;auto-tagging ~' facility built-in to the Workbench interface. For example, in annotating journalistic document collections with &amp;quot;Named Entity&amp;quot; tags, one might want to simply pre-tag every occurrence of &amp;quot;President Clinton&amp;quot; with Person.. ~ Of course, these actions should be taken with some care, since mistagging entities throughout a document might actually lead to an increase in effort required to accurately fix or remove tags in the document.</Paragraph>
    <Paragraph position="4"> A more powerful approach is to allow patterns, or rules, to form the basis for this pre-tagging. The Alembic phrase-rule interpreter provides the basis for developing rule-based pre-tagging heuristics in the Workbench. In the current version of the Workbench, the user is free to compose these &amp;quot;phraser&amp;quot; rules and group them into specialized rule sets. Figure 2 shows an example sequence of rules that could be composed for pre-tagging a corpus with Person tags. The Brill control regime interprets these rules strictly sequentially: rule n is applied wherever in the text it can be; it is then discarded and rule n+l is consulted. There is no unconstrained forward chaining using a &amp;quot;soup&amp;quot; of rules as in a standard production (or rule-based) system. The Alembic &amp;quot;phraser&amp;quot; rule interpreter has been applied to tagging named entities, sentence chunks, simple entity relations (&amp;quot;template element&amp;quot; in the parlance of MUC6), and other varieties of phrases.</Paragraph>
    <Paragraph position="5">  produces Person phrases around any word immediately to the fight of a title and/or honorific, and then (2) grows the extent of the phrase to the fight one lexeme, if that word i s a proper noun.</Paragraph>
  </Section>
  <Section position="5" start_page="349" end_page="350" type="metho">
    <SectionTitle>
4. Mixed-initiative text annotation
</SectionTitle>
    <Paragraph position="0"> In addition to allowing users to define pre-tagging rules, we have developed a learning procedure that can be used to induce these rules from small training corpora.</Paragraph>
    <Paragraph position="1"> Operationally, an annotator starts by generating a small initial corpus and then invokes the learner to derive a set of pre-tagging rules. These rules can then be applied to new, unseen texts to pre-tag them. Figure 3 illustrates this bootstrapping cycle.</Paragraph>
    <Paragraph position="2"> i The Named Entity task from MUC6 consists of adding tags to indicate expressions of type Person, Location, Organization, Date, Time and Money, see \[6\].</Paragraph>
    <Paragraph position="3">  The earlier we can extract heuristic rules on the basis of manually tagged data, the earlier the user can be relieved from some portion of the chore of physically marking up the text--the user will need to edit and/or add only a fraction of the total phrases in a given document. In our experience of applying the Alembic phrase rule learner to named-entity and similar problems, our errorreduction learning method requires only modest amounts of training data. (We present performance details in  the user in a corpus development cycle, making use of pre-tagging facilities, analysis facilities, and the automatic generation of pre-tagging rule sets through machine learning.</Paragraph>
    <Paragraph position="4"> As the human annotator continues generating reliable training data, she may, at convenient intervals, reinvoke the learning process. As the amount of training data increases, the performance of the learned rules tends to increase, and so the amount of labor saved in pre-tagging subsequent training data is further increased. The bootstrapping effect tends to increase over time.</Paragraph>
    <Paragraph position="5"> For the &amp;quot;named entity&amp;quot; task in MUC6 approximately 25,000 words were provided as annotated training data by the conference organizers (&amp;quot;formal training&amp;quot; and &amp;quot;dryrun&amp;quot; data sets). Prior to developing the Alembic Workbench, we were able to use this amount of data in Alembic to generate a system performing at 85.2 P&amp;R on unseen test data. 2 Based on the tagging rates we have measured thus far using the Workbench, it would take somewhere between 1.5 to 2.5 hours to tag these 25,000 words of data.</Paragraph>
    <Paragraph position="6"> There is a limit on how much one can reduce the timerequirements for generating reliable training data--this is the rate required by a human domain expert to carefully read and edit a perfectly pre-annotated training corpus. Training data cannot be generated without this 2 P&amp;R (or F-measure) is a weighted combination of recall and precision.</Paragraph>
    <Paragraph position="7"> human investment. 3 Indeed, in situations where the quality of the data is particularly important (as it is in, say a multi-system evaluation such as MUC), it is typical that multiple reviews of the same corpus is performed by various annotators, especially given the known ambiguity of any annotation task definition.</Paragraph>
  </Section>
  <Section position="6" start_page="350" end_page="351" type="metho">
    <SectionTitle>
5. Manual refinement of automatically
</SectionTitle>
    <Paragraph position="0"> derived pre-tagging heuristics In the previous section we presented our approach to mixed-initiative corpus development and tagging heuristics without assuming any sophistication on the part of the human user beyond a clear understanding of the information extraction task being addressed.</Paragraph>
    <Paragraph position="1"> Usually, however, even a lay end-user is likely to have a number of intuitions about how the un-annotated data could be pre-tagged to reduce the burden of manual tagging. Hand-coded rules can be applied in concert with the machine-derived rules mentioned earlier. One way this can be done is by invoking the rule learning subsequent to the application of the hand-cxxted pre-tagging rules. On the other hand, if the user notices a consistent mistake being made by the machine-learned rules early in the bootstrapping process, the user can augment the machine-derived rule sequence with manually composed rules. In fact, every rule composed by the learning procedure is completely inspectable by the user, and so some users may want to modify individual machine-derived rules, perhaps to expand their generality beyond the particular data available in the emerging corpus.</Paragraph>
    <Paragraph position="2"> This is another way, then, that the Alembic Workbench environment enables and encourages the mixed, or cooperative, application of human and machine skills to the combined task of developing a domain-specific corpus and set of extraction heuristics.</Paragraph>
    <Paragraph position="3"> Of course, composing rules is somewhat akin to programming, and not all users will be inclined, or well-equipped, to become involved in this process. One impediment to end-users composing their own rules is the particular syntax of Alembic's phraser rules, so we anticipate exploring other, simpler rule languages that will encourage end-user participation. Another approach that we are interested in exploring involves supporting more indirect feedback or directives from the user that are rooted more closely to examples in the data.</Paragraph>
    <Paragraph position="4">  cannot be generated faster than this, and that these data may indeed be helpful in the learning procedure of some other systems. But all such data will remain suspect as far as being considered part of an annotated training corpus until inspected by a human, given the vagaries of genre and style that can easily foil the most sophisticated systems.  Similarities and differences between manual and automatic rule formation The automatic rule-learuing procedure uses a generate-and-test approach to learn a sequence of rules. A set of rule schemata, defining a set of possible rule instances determines the rule space that the learning procedure explores. The learner uses indexing based on the actual data present in the corpus to help it explore the rule space efficiently. The learning process is initiated by deriving and applying an initial labeling function based on the differences between an un-annotated version and a correctly annotated version of the corpus. Then, during each learning cycle, the learner tries out applicable rule instances and selects the rule that most improves the score when applied to the corpus. The score is determined by evaluating the corpus as currently annotated against the correctly annotated version, using some evaluation function (generally precision, recall or F-measure). The corpus annotation is updated by applying the chosen rule, and the learning cycle repeats. This cycle is continued until a stopping criterion is re, ached, which is usually defined as the point where performance improvement falls below a threshold, or ceases. Other alternatives include setting a strict limit on the number of rules, and testing the performance improvement of a rule on a corpus distinct from the training set.</Paragraph>
    <Paragraph position="5"> Of course, there are two important advantages that a human expert might have over the machine algorithm: linguistic intuition and world knowledge. Rules that include references to a single lexeme can be expanded to more general applicability by the human expert who is able to predict alternatives that lie outside the current corpus available to the machine. By supporting multiple ways in which rules can be hypothesized, refined and tested, the strengths of both sources of knowledge can be brought to bear.</Paragraph>
  </Section>
  <Section position="7" start_page="351" end_page="353" type="metho">
    <SectionTitle>
6. Experimental Results
</SectionTitle>
    <Paragraph position="0"> We are still in the early stages of evaluating the performance of the Alembic Workbench along a number of different dimensions. However, the results from early experiments are encouraging. Figure 4 compares the productivity rates using different corpus development utilities. These are indicated by the four categories on the X-axis: (1) using SGML-mode in emacs (by an expert user); (2) using the Workbench interface and &amp;quot;auto-tag&amp;quot; string-matching utility only; (3) using the Workbench following the application of learned tagging rules derived from 5 short documents--approximately 1,500 words, and (4) using the Workbench following the application of learned tagging rules again, but this time with the learned rules having trained on 100 documents (approximately 48,000 words), instead of only five documents.</Paragraph>
    <Paragraph position="1"> As can be seen in these experiments, there is a clear increase in the productivity as a function of both the user interface (second column) and the application of pre-tagging rules (third and fourth columns). The large step in performance between columns three and four indicate that repeated invocation of the learning process during the intermediate stages of the corpus development cycle will likely result in acceleration of the annotation rate. (As it happens, these results are probably underestimating the pre-tagging productivity.</Paragraph>
    <Paragraph position="2"> The reason for this is that the version of the Workbench used was not yet able to incorporate date and time annotations generated by a separate pre-processing step; this date and time tagger performs at an extremely high level of precision for this genre---in the high nineties P&amp;R.) These initial experiments involved a single expert annotator on a single tagging task (MUC6 named entity). The annotator was very familiar with the tagging task.</Paragraph>
    <Paragraph position="3">  bootstrap method with 5-document training set; (4) AWB rule-learning bootstrap method with 100-document trainin\[ set. See discussion in text.</Paragraph>
    <Paragraph position="4"> To place this in the perspective of the human annotator, after only about 15 minutes of named entity tagging, having annotated some 1,500 words of text with approximately 150 phrases, the phrase rule learner can derive heuristic rules that produce a pre-tagging performance rate (P&amp;R) of between 50 and 60 percent.</Paragraph>
    <Paragraph position="5"> Of course, this performance is far short of what is needed for a practical extraction system, but it already constitutes a major source for labor savings, since 50 to 60 percent of the annotations that need to be moused (or clicked) in are already there. Since the precision at this early stage is only around 60 percent, there will be extra phrases that need (1) to be removed, (2) their assigned category changed (from, say,  organization to person), or (3) their boundaries adjusted. It turns out that for the first two of these kinds of precision errors, the manual corrections are extremely quick to perform. (Boundaries are not really difficult to modify, but the time required is approximately the same as inserting a tag from scratch.) In addition, making these corrections removes both a precision and a recall error at the same time. Therefore, it turns out that even at this very early stage, the modest pre-tagging performance gained from applying the learning procedure provides measurable performance improvement.</Paragraph>
    <Paragraph position="6"> In order to obtain more detailed results on the effect of pre-tagging corpora, we conducted another experiment in which we made direct use of the iterative automatic generation of rules from a growing manually-tagged corpus. Using the same skilled annotator, we inlroduced a completely new corpus for which named-entity tagging happened to be needed within our company. We randomly divided approximately 50 documents of varying sizes into five groups. The word counts for these five groups were: Groupl: 19,300; Group2: 13,800; Group3: 6,3130; Group4: 15,800; Group5: 8,000; for a total of 63,000 words. After manually tagging the first group, we invoked the rule learning procedure. Applying the learning procedure on each training set required two to three hours of elapsed time on a Sun Sparc Ultra. The new tagging rules were then applied to the next ten documents prior to being manually tagged/edited. This enlarged corpus was then used to derive a new rule set to be applied to the next group of documents, and so on. A summarization of the results are presented in Figure 5.</Paragraph>
    <Paragraph position="7"> Clearly, more experiments are called for we plan to conduct these across different annotators, task types, and languages, to better evaluate productivity, quality and other aspects of the annotation process.</Paragraph>
    <Paragraph position="8"> It is extremely difficult to control many of the features that influence the annotation process, such as the intrinsic complexity of the topic in a particular document, the variation in tag-density (tags per word) that may occur, the user's own training effect as the structure and content of documents become more familiar, office distractions, etc. In order to gain a better understanding of the underlying tagging performance of the rule learner, and so separate out some of these human factors issues, we ran an automated experiment in which different random subsets of sentences were used to train rule sets, which were then evaluated on a static test corpus. The results shown in Figure 6 give some indication of the ability of the rule-sequence learning procedure to glean useful generalizations from meager amounts of training data.</Paragraph>
    <Paragraph position="9">  incremental application of automatically acquired rule sets. The first observation we make is that there is a clear and obvious direction of improvement--by the time 30 documents have been tagged, the annotation rate on Group 4 has increased considerably. It is important to note, however, that there is still noise in the curve. In addition, the granularity is perhaps still too coarse to measure the incremental influences of pre-tagging rules.  One clear effect of increasing training set size is a reduction in the sensitivity of the learning procedure to particular training sets. We hypothesize that this effect is partly indicative of the generalization behavior on which the learning procedure is based, which amplifies  the effects of choosing more or less representative training sentences by chance. Since the learning process is not merely memorizing phrases, but generating contextual rules to try to predict phrase types and extents, the rules are very sensitive to extremely small selections of training sentences. Figure 7 shows the F-measure performance smoothed by averaging neighboring data points, to get a clearer picture of the general tendency.</Paragraph>
    <Paragraph position="10"> We should note that the Alembic Workbench, having been developed only recently in our laboratory, was not available to us in the course of our effort to apply the Alembic system to the MUC6 and MET tasks.</Paragraph>
    <Paragraph position="11"> Therefore we have not been able to measure its influence in preparing for a particular new text processing task. We intend to use the system to prepare for future evaluations (including MUC7 and MET2) and to carefully evaluate the Alembic Workbench as an environment for the mixed-initiative development of information extraction systems in multiple languages.</Paragraph>
  </Section>
  <Section position="8" start_page="353" end_page="354" type="metho">
    <SectionTitle>
7. Implementation
</SectionTitle>
    <Paragraph position="0"> The Alembic Workbench interface has been written in Tci/Tk. Some of the analysis and reporting utilities (available from within the interface as well as Unix command-line utilities) are written in Perl, C or Lisp.</Paragraph>
    <Paragraph position="1"> The separate Alembic NLP system consists of C pre-processing taggers (for dates, word and sentence tokenizafion and part-of-speech assignments) and a Lisp image that incorporates the rest of Alembic: the phrase-rule interpreter, the phrase rule learner, and a number of discourse-level inference mechanisms described in \[8\].</Paragraph>
    <Paragraph position="2"> This code currently runs on Sun workstations running Sun OS 4.1.3 and Solaris 2.4 (Sun OS 5.4) and greater; we have begun porting the system to Windows NT/Windows 95. We anticipate providing an API for integrating other NLP systems in the near future.</Paragraph>
    <Paragraph position="3"> The Workbench reads and saves its work in the form of SGML-encoded files, though the original document need not contain any SGML mark-up at all. These files me parsed with the help of an SGML normalizer. 4 During the course of the annotation process the Workbench uses a &amp;quot;Parallel Tag File&amp;quot; (PTF) format, which separates out the embedded annotations from the source text, and organizes user-defined sets of annotations within distinct &amp;quot;tag files.&amp;quot; While these files are generally hidden from the user, they provide a basis for the combination and separation of document annotations (&amp;quot;tagsets&amp;quot;) without needing to modify or otherwise disturb the base document. This allows the user to view 4 In cases where documents use some of the more complex aspects of SGML, the user supplies a Document Type Description (DTD) file for use in normalization. For simple SGML documents, or documents with no original SGML markup at all, no DTD needs to be specified.</Paragraph>
    <Paragraph position="4"> only Named Entity tags, or only tokenization tags, or any desired subset of tagsets. Thus, the Workbench is written to be TIPSTER-compliant, though it is not itself a document manager as envisioned by that architecture (see \[5\]). We anticipate integrating the Workbench with other TIPSTER compliant modules and document managers via the exchange of SGMLformatted documents. The Parallel Tag File (PTF) format used by the Workbench provides another means by which a translator could be written.</Paragraph>
    <Paragraph position="5"> 8. Future Work Broadly defined, there are two distinct types of users who we imagine will find the Workbench useful: NLP researchers and information extraction system end-users.</Paragraph>
    <Paragraph position="6"> While our dominant focus so far has been on supporting the language research community, it is important to remember that new domains for language processing generally, and information extraction in particular, will have their own domain experts, and we want the text annotation aspects of the tool to be quite usable by a wide population. In this vein we would like to enable virtually any user to be able to compose new patterns (rules) for performing pre-tagging on the data. While the current rule language has a simple syntax, as well as an extremely simple control regimen, we do not imagine all users will want to engage directly in an exploration for pre-tagging rules. A goal for our future research is to explore new methods for incorporating end-user feedback to the learning procedure. This feedback might include modifying a very simplified form of a single rule for greater generality by integrating thesauri to construct word-list suggestions.</Paragraph>
    <Paragraph position="7"> We also would like to give users immediate feedback as to how a single rule applies (correctly and incorrectly) to many different phrases in the corpus.</Paragraph>
    <Paragraph position="8"> In this paper we have concentrated on the named entity task as a generic case of corpus annotation. Of course, there are many different ways in which corpora are being annotated for many different tasks. Some of the specific extensions to the user interface that we have already begun building include part-of-speech tagging (and &amp;quot;dense&amp;quot; markup more generally), and full parse syntactic tagging (where we believe reliable training data can be obtained much more quickly than heretofore). In these and other instances the tagging process can be accelerated by applying partial knowledge early on, transforming the task once again into that of editing and correcting. Most of these tagging tasks would be improved by making use of methods that preferentially select ambiguous data for manual annotation--for example, as described in \[4\].</Paragraph>
    <Paragraph position="9"> There are a number of psychological and human factors issues that arise when one considers how the pre-annotated data in a mixed-initiative system may affect  the human editing or post-processing. If the pre-tagging process has a relatively high recall, then we hypothesize that the human will tend increasingly to trust the pre-annotations, and thereby forget to read the texts carefully to discover any phrases that escaped being annotated. A similar effect seems possible for relatively high precision systems, though proper interface design (to highlight the type assigned to a particular phrase) should be able to mitigate these tendencies. A more subtle interaction is &amp;quot;theory creep,&amp;quot; where the heuristics induced by the machine learning component begin to be adopted by the human annotator, due, in many cases, to the intrinsic ambiguity of defining annotation tasks in the first place. In all of these cases the most reliable method for detecting these human/machine interactions is probably to use some representative sub-population of the corpus documents to measure and analyze the inter-annotator agreement between human annotators who have and who have not been exposed to the machine derived heuristics for assigning annotations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML