XML Viewer - j93-2004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/j93-2004_concl.xml
Size: 23,565 bytes
Last Modified: 2025-10-06 13:57:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-2004">
  <Title>Building a Large Annotated Corpus of English: The Penn Treebank</Title>
  <Section position="4" start_page="319" end_page="328" type="concl">
    <SectionTitle>
4. Bracketing
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="319" end_page="319" type="sub_section">
      <SectionTitle>
4.1 Basic Methodology
</SectionTitle>
      <Paragraph position="0"> The methodology for bracketing the corpus is completely parallel to that for tagging-hand correction of the output of an errorful automatic process. Fidditch, a deterministic parser developed by Donald Hindle first at the University of Pennsylvania and subsequently at AT&amp;T Bell Labs (Hindle 1983, 1989), is used to provide an initial parse of the material. Annotators then hand correct the parser's output using a mouse-based interface implemented in GNU Emacs Lisp. Fidditch has three properties that make it ideally suited to serve as a preprocessor to hand correction: * Fidditch always provides exactly one analysis for any given sentence, so that annotators need not search through multiple analyses.</Paragraph>
      <Paragraph position="1"> * Fidditch never attaches any constituent whose role in the larger structure it cannot determine with certainty. In cases of uncertainty, Fidditch chunks the input into a string of trees, providing only a partial structure for each sentence.</Paragraph>
      <Paragraph position="2"> * Fidditch has rather good grammatical coverage, so that the grammatical chunks that it does build are usually quite accurate.</Paragraph>
      <Paragraph position="3"> Because of these properties, annotators do not need to rebracket much of the parser's output--a relatively time-consuming task. Rather, the annotators' main task is to &amp;quot;glue&amp;quot; together the syntactic chunks produced by the parser. Using a mouse-based interface, annotators move each unattached chunk of structure under the node to which it should be attached. Notational devices allow annotators to indicate uncertainty concerning constituent labels, and to indicate multiple attachment sites for ambiguous modifiers. The bracketing process is described in more detail in Section 4.3.</Paragraph>
    </Section>
    <Section position="2" start_page="319" end_page="320" type="sub_section">
      <SectionTitle>
4.2 The Syntactic Tagset
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the set of syntactic tags and null elements that we use in our skeletal bracketing. More detailed information on the syntactic tagset and guidelines concerning its use are to be found in Santorini and Marcinkiewicz (1991).</Paragraph>
      <Paragraph position="1"> Although different in detail, our tagset is similar in delicacy to that used by the Lancaster Treebank Project, except that we allow null elements in the syntactic annotation. Because of the need to achieve a fairly high output per hour, it was decided not to require annotators to create distinctions beyond those provided by the parser.</Paragraph>
      <Paragraph position="2"> Our approach to developing the syntactic tagset was highly pragmatic and strongly influenced by the need to create a large body of annotated material given limited human resources. Despite the skeletal nature of the bracketing, however, it is possible to make quite delicate distinctions when using the corpus by searching for combinations of structures. For example, an SBAR containing the word to immediately before the VP will necessarily be infinitival, while an SBAR containing a verb or auxiliary with a 10 We would like to emphasize that the percentage given for the modified output of PARTS does not represent an error rate for PARTS. It reflects not only true mistakes in PARTS performance, but also the many and important differences in the usage of Penn Treebank POS tags and the usage of tags in the original Brown Corpus material on which PARTS was trained.</Paragraph>
      <Paragraph position="3">  1. ADJP 2. ADVP 3. NP 4. PP 5. S 6. SBAR 7. SBARQ 8. SINV 9. SQ 10. VP 11. WHADVP 12. WHNP 13. WHPP 14. X Null elements 2. 0 3. T 4. NIL  tense feature will necessarily be tensed. To take another example, so-called that-clauses can be identified easily by searching for SBARs containing the word that or the null element 0 in initial position.</Paragraph>
      <Paragraph position="4"> As can be seen from Table 3, the syntactic tagset used by the Penn Treebank includes a variety of null elements, a subset of the null elements introduced by Fidditch. While it would be expensive to insert null elements entirely by hand, it has not proved overly onerous to maintain and correct those that are automatically provided. We have chosen to retain these null elements because we believe that they can be exploited in many cases to establish a sentence's predicate-argument structure; at least one recipient of the parsed corpus has used it to bootstrap the development of lexicons for particular NLP projects and has found the presence of null elements to be a considerable aid in determining verb transitivity (Robert Ingria, personal communication). While these null elements correspond more directly to entities in some grammatical theories than in others, it is not our intention to lean toward one or another theoretical view in producing our corpus. Rather, since the representational framework for grammatical structure in the Treebank is a relatively impoverished flat context-free notation, the easiest mechanism to include information about predicate-argument structure, although indirectly, is by allowing the parse tree to contain explicit null items.</Paragraph>
    </Section>
    <Section position="3" start_page="320" end_page="325" type="sub_section">
      <SectionTitle>
4.3 Sample Bracketing Output
</SectionTitle>
      <Paragraph position="0"> Below, we illustrate the bracketing process for the first sentence of our sample text.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the output of Fidditch (modified slightly to include our POS tags).</Paragraph>
      <Paragraph position="2"> As Figure 3 shows, Fidditch leaves very many constituents unattached, labeling them as &amp;quot;?&amp;quot;, and its output is perhaps better thought of as a string of tree fragments than as a single tree structure. Fidditch only builds structure when this is possible for a purely syntactic parser without access to semantic or pragmatic information, and it</Paragraph>
      <Paragraph position="4"> Sample bracketed text--full structure provided by Fidditch.</Paragraph>
      <Paragraph position="5"> always errs on the side of caution. Since determining the correct attachment point of prepositional phrases, relative clauses, and adverbial modifiers almost always requires extrasyntactic information, Fidditch pursues the very conservative strategy of always leaving such constituents unattached, even if only one attachment point is syntactically possible. However, Fidditch does indicate its best guess concerning a fragment's attachment site by the fragment's depth of embedding. Moreover, it attaches prepositional phrases beginning with of if the preposition immediately follows a noun; thus, tale of... and boatload of... are parsed as single constituents, while first of... is not.</Paragraph>
      <Paragraph position="6"> Since Fidditch lacks a large verb lexicon, it cannot decide whether some constituents serve as adjuncts or arguments and hence leaves subordinate clauses such as infini- null Mitchell P. Marcus et al. Building a Large Annotated Corpus of English tives as separate fragments. Note further that Fidditch creates adjective phrases only when it determines that more than one lexical item belongs in the ADJP. Finally, as is well known, the scope of conjunctions and other coordinate structures can only be determined given the richest forms of contextual information; here again, Fidditch simply turns out a string of tree fragments around any conjunction. Because all decisions within Fidditch are made locally, all commas (which often signal conjunction) must disrupt the input into separate chunks.</Paragraph>
      <Paragraph position="7"> The original design of the Treebank called for a level of syntactic analysis comparable to the skeletal analysis used by the Lancaster Treebank, but a limited experiment was performed early in the project to investigate the feasibility of providing greater levels of structural detail. While the results were somewhat unclear, there was evidence that annotators could maintain a much faster rate of hand correction if the parser output was simplified in various ways, reducing the visual complexity of the tree representations and eliminating a range of minor decisions. The key results of this experiment were as follows: * Annotators take substantially longer to learn the bracketing task than the POS tagging task, with substantial increases in speed occurring even after two months of training.</Paragraph>
      <Paragraph position="8"> * Annotators can correct the full structure provided by Fidditch at an average speed of approximately 375 words per hour after three weeks and 475 words per hour after six weeks.</Paragraph>
      <Paragraph position="9"> * Reducing the output from the full structure shown in Figure 3 to a more skeletal representation similar to that used by the Lancaster UCREL Treebank Project increases annotator productivity by approximately 100-200 words per hour.</Paragraph>
      <Paragraph position="10"> * It proved to be very difficult for annotators to distinguish between a verb's arguments and adjuncts in all cases. Allowing annotators to ignore this distinction when it is unclear (attaching constituents high) increases productivity by approximately 150-200 words per hour.</Paragraph>
      <Paragraph position="11"> Informal examination of later annotation showed that forced distinctions cannot be made consistently.</Paragraph>
      <Paragraph position="12"> As a result of this experiment, the originally proposed skeletal representation was adopted, without a forced distinction between arguments and adjuncts. Even after extended training, performance varies markedly by annotator, with speeds on the task of correcting skeletal structure without requiring a distinction between arguments and adjuncts ranging from approximately 750 words per hour to well over 1,000 words per hour after three or four months' experience. The fastest annotators work in bursts of well over 1,500 words per hour alternating with brief rests. At an average rate of 750 words per hour, a team of five part-time annotators annotating three hours a day should maintain an output of about 2.5 million words a year of &amp;quot;treebanked&amp;quot; sentences, with each sentence corrected once.</Paragraph>
      <Paragraph position="13"> It is worth noting that experienced annotators can proofread previously corrected material at very high speeds. A parsed subcorpus of over I million words was recently proofread at an average speed of approximately 4,000 words per annotator per hour. At this rate of productivity, annotators are able to find and correct gross errors in parsing, but do not have time to check, for example, whether they agree with all prepositional phrase attachments.</Paragraph>
      <Paragraph position="14">  Sample bracketed text--after simplification, before correction.</Paragraph>
      <Paragraph position="15"> The process that creates the skeletal representations to be corrected by the annotators simplifies and flattens the structures shown in Figure 3 by removing POS tags, nonbranching lexical nodes, and certain phrasal nodes, notably NBAR. The output of the first automated stage of the bracketing task is shown in Figure 4.</Paragraph>
      <Paragraph position="16"> Annotators correct this simplified structure using a mouse-based interface. Their primary job is to &amp;quot;glue&amp;quot; fragments together, but they must also correct incorrect parses and delete some structure. Single mouse clicks perform the following tasks, among  constituent tag and then sweeping out the intended constituent with the mouse. The tag is checked to assure that it is a legal label.</Paragraph>
      <Paragraph position="17"> * Change the label of a constituent. The new tag is checked to assure that it is legal.</Paragraph>
      <Paragraph position="18"> The bracketed text after correction is shown in Figure 5. The fragments are now connected together into one rooted tree structure. The result is a skeletal analysis in that much syntactic detail is left unannotated. Most prominently, all internal structure of the NP up through the head and including any single-word post-head modifiers is left unannotated.</Paragraph>
      <Paragraph position="19"> As noted above in connection with POS tagging, a major goal of the Treebank project is to allow annotators only to indicate structure of which they were certain. The Treebank provides two notational devices to ensure this goal: the X constituent label and so-called &amp;quot;pseudo-attachment.&amp;quot; The X constituent label is used if an annotator is sure that a sequence of words is a major constituent but is unsure of its syntactic category; in such cases, the annotator simply brackets the sequence and labels it X. The second notational device, pseudo-attachment, has two primary uses. On the one hand,  Computational Linguistics Volume 19, Number 2 it is used to annotate what Kay has called permanent predictable ambiguities, allowing an annotator to indicate that a structure is globally ambiguous even given the surrounding context (annotators always assign structure to a sentence on the basis of its context). An example of this use of pseudo-attachment is shown in Figure 5, where the participial phrase blown ashore 375 years ago modifies either warriors or boatload, but there is no way of settling the question--both attachments mean exactly the same thing. In the case at hand, the pseudo-attachment notation indicates that the annotator of the sentence thought that VP-1 is most likely a modifier of warriors, but that it is also possible that it is a modifier of boatload. 11 A second use of pseudo-attachment is to allow annotators to represent the &amp;quot;underlying&amp;quot; position of extraposed elements; in addition to being attached in its superficial position in the tree, the extraposed constituent is pseudoattached within the constituent to which it is semantically related. Note that except for the device of pseudo-attachment, the skeletal analysis of the Treebank is entirely restricted to simple context-free trees.</Paragraph>
      <Paragraph position="20"> The reader may have noticed that the ADJP brackets in Figure 4 have vanished in Figure 5. For the sake of the overall efficiency of the annotation task, we leave all ADJP brackets in the simplified structure, with the annotators expected to remove many of them during annotation. The reason for this is somewhat complex, but provides a good example of the considerations that come into play in designing the details of annotation methods. The first relevant fact is that Fidditch only outputs ADJP brackets within NPs for adjective phrases containing more than one lexical item. To be consistent, the final structure must contain ADJP nodes for all adjective phrases within NPs or for none; we have chosen to delete all such nodes within NPs under normal circumstances. (This does not affect the use of the ADJP tag for predicative adjective phrases outside of NPs.) In a seemingly unrelated guideline, all coordinate structures are annotated in the Treebank; such coordinate structures are represented by Chomsky-adjunction when the two conjoined constituents bear the same label.</Paragraph>
      <Paragraph position="21"> This means that if an NP contains coordinated adjective phrases, then an ADJP tag will be used to tag that coordination, even though simple ADJPs within NPs will not bear an APJP tag. Experience has shown that annotators can delete pairs of brackets extremely quickly using the mouse-based tools, whereas creating brackets is a much slower operation. Because the coordination of adjectives is quite common, it is more efficient to leave in ADJP labels, and delete them if they are not part of a coordinate structure, than to reintroduce them if necessary.</Paragraph>
    </Section>
    <Section position="4" start_page="325" end_page="327" type="sub_section">
      <SectionTitle>
5. Progress to Date
5.1 Composition and Size of Corpus
</SectionTitle>
      <Paragraph position="0"> Table 4 shows the output of the Penn Treebank project at the end of its first phase. All the materials listed in Table 4 are available on CD-ROM to members of the Linguistic Data Consortium. 12 About 3 million words of POS-tagged material and a small sampiing of skeletally parsed text are available as part of the first Association for Computational Linguistics/Data Collection Initiative CD-ROM, and a somewhat larger subset of materials is available on cartridge tape directly from the Penn Treebank Project. For information, contact the first author of this paper or send e-mail to treebank@unagi.cis.upenn.edu. null  book chapters, from a variety of American authors including Mark Twain, Henry Adams, Willa Cather, Herman Melville, W. E. B. Dubois, and Ralph Waldo Emerson.</Paragraph>
      <Paragraph position="1"> * The MUC-3 texts are all news stories from the Federal News Service about terrorist activities in South America. Some of these texts are translations of Spanish news stories or transcripts of radio broadcasts.</Paragraph>
      <Paragraph position="2"> They are taken from training materials for the Third Message  collected as training materials for the DARPA Air Travel Information System project.</Paragraph>
      <Paragraph position="3"> The entire corpus has been tagged for POS information, at an estimated error rate  Computational Linguistics Volume 19, Number 2 of approximately 3%. The POS-tagged version of the Library of America texts and the Department of Agriculture bulletins have been corrected twice (each by a different annotator), &amp;quot;and the corrected files were then carefully adjudicated; we estimate the error rate of the adjudicated version at well under 1%. Using a version of PARTS retrained on the entire preliminary corpus and adjudicating between the output of the retrained version and the preliminary version of the corpus, we plan to reduce the error rate of the final version of the corpus to approximately 1%. All the skeletally parsed materials have been corrected once, except for the Brown materials, which have been quickly proofread an additional time for gross parsing errors.</Paragraph>
    </Section>
    <Section position="5" start_page="327" end_page="328" type="sub_section">
      <SectionTitle>
5.2 Future Directions
</SectionTitle>
      <Paragraph position="0"> A large number of research efforts, both at the University of Pennsylvania and elsewhere, have relied on the output of the Penn Treebank Project to date. A few examples already in print: a number of projects investigating stochastic parsing have used either the POS-tagged materials (Magerman and Marcus 1990; Brill et al. 1990; Brill 1991) or the skeletally parsed corpus (Weischedel et al. 1991; Pereira and Schabes 1992). The POS-tagged corpus has also been used to train a number of different POS taggers including Meteer, Schwartz, and Weischedel (1991), and the skeletally parsed corpus has been used in connection with the development of new methods to exploit intonational cues in disambiguating the parsing of spoken sentences (Veilleux and Ostendorf 1992).</Paragraph>
      <Paragraph position="1"> The Penn Treebank has been used to bootstrap the development of lexicons for particular applications (Robert Ingria, personal communication) and is being used as a source of examples for linguistic theory and psychological modelling (e.g. Niv 1991). To aid in the search for specific examples of grammatical phenomena using the Treebank, Richard Pito has developed tgrep, a tool for very fast context-free pattern matching against the skeletally parsed corpus, which is available through the Linguistic Data Consortium.</Paragraph>
      <Paragraph position="2"> While the Treebank is being widely used, the annotation scheme employed has a variety of limitations. Many otherwise clear argument/adjunct relations in the corpus are not indicated because of the current Treebank's essentially context-free representation. For example, there is at present no satisfactory representation for sentences in which complement noun phrases or clauses occur after a sentential level adverb. Either the adverb is trapped within the VP, so that the complement can occur within the VP where it belongs, or else the adverb is attached to the S, closing off the VP and forcing the complement to attach to the S. This &amp;quot;trapping&amp;quot; problem serves as a limitation for groups that currently use Treebank material semiautomatically to derive lexicons for particular applications. For most of these problems, however, solutions are possible on the basis of mechanisms already used by the Treebank Project. For example, the pseudo-attachment notation can be extended to indicate a variety of crossing dependencies. We have recently begun to use this mechanism to represent various kinds of dislocations, and the Treebank annotators themselves have developed a detailed proposal to extend pseudo-attachment to a wide range of similar phenomena.</Paragraph>
      <Paragraph position="3"> A variety of inconsistencies in the annotation scheme used within the Treebank have also become apparent with time. The annotation schemes for some syntactic categories should be unified to allow a consistent approach to determining predicate-argument structure. To take a very simple example, sentential adverbs attach under VP when they occur between auxiliaries and predicative ADJPs, but attach under S when they occur between auxiliaries and VPs. These structures need to be regularized.</Paragraph>
      <Paragraph position="4"> As the current Treebank has been exploited by a variety of users, a significant number have expressed a need for forms of annotation richer than provided by the project's first phase. Some users would like a less skeletal form of annotation of surface  Mitchell P. Marcus et al. Building a Large Annotated Corpus of English grammatical structure, expanding the essentially context-free analysis of the current Penn Treebank to indicate a wide variety of noncontiguous structures and dependencies. A wide range of Treebank users now strongly desire a level of annotation that makes explicit some form of predicate-argument structure. The desired level of representation would make explicit the logical subject and logical object of the verb, and would indicate, at least in clear cases, which subconstituents serve as arguments of the underlying predicates and which serve as modifiers.</Paragraph>
      <Paragraph position="5"> During the next phase of the Treebank project, we expect to provide both a richer analysis of the existing corpus and a parallel corpus of predicate-argument structures.</Paragraph>
      <Paragraph position="6"> This will be done by first enriching the annotation of the current corpus, and then automatically extracting predicate-argument structure, at the level of distinguishing logical subjects and objects, and distinguishing arguments from adjuncts for clear cases. Enrichment will be achieved by automatically transforming the current Penn Treebank into a level of structure close to the intended target, and then completing the conversion by hand.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML