File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0105_intro.xml

Size: 11,883 bytes

Last Modified: 2025-10-06 14:06:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0105">
  <Title>Probabilistic Parsing of Unrestricted English Text, With a Highly-Detailed Grammar</Title>
  <Section position="4" start_page="19" end_page="24" type="intro">
    <SectionTitle>
4. EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="19" end_page="21" type="sub_section">
      <SectionTitle>
4.1. Evaluation Methodology
</SectionTitle>
      <Paragraph position="0"> In our view, any effective evv~luation methodology for automatic grammatical analysis must confront head-on the problem of multiple correct ~n~wers in tagging and parsing. That is, it is often the case that there is more than one &amp;quot;correct tag&amp;quot; for a word in context, where that word could be considered to be functioning as: a proper or a common noun; ~n adjective or a noun; a participle or an adjective; a gerundial noun or a noun; e an adverbial particle or a locative adverb; and even an adjective or an adverb. This is true even where there are highly detailed and well-understood eterminology of (Long, 1961): for e.g. a sleeping pill vs. to make a good lim,~  guidelines for the application of each tag to text. And obviously the existence of multiple correct taggings for a word is to be expected a fortiori where a highly ramified system of semantic categories is involved. It fonows that multiple correct parses exist for many sentences, since by det~nition any change in tag means a change in parse. But other sources of multiple correct parses exist as well, and range from, say, several equally good attachment sites within a parse for a given modifier, even given full document context, to cases where the grammar itself provides several equally good parses for a sentence, through the presence of normally independent rules whose function nonetheless overlaps to some degree.</Paragraph>
      <Paragraph position="1"> Barring the recording of the set of correct tags for each word~ and of the set of correct parses for each sentence, in a treebank, the next-best solution to the problem of multiple correct ~n~wers is to at least provide such a recording in one's test set, i.e. to provide a &amp;quot;gold standard&amp;quot; test set with all correct tags and parses for each word in context. This is the solution that was adopted in creating the ATR/Lancaster English Treebank.</Paragraph>
      <Paragraph position="2"> The way we evaluate our tagger is to compare its performance to the set of correct tags for each word of each sentence of our &amp;quot;gold standard&amp;quot; test data. Thus, in all cases we are able to take into account the full set of &amp;quot;correct&amp;quot; answers. ~ Since 32% of running words in our test data have 2 or more correct tags, potential differences in performance evaluation are large vis-a-vis traditional metrics, s Similarly, in the case of the parser, we evaluate performance against a special &amp;quot;gold standard&amp;quot; test set which lists every correct parse with respect to the Grammar for each test sentence. We utilize two measures. First is exact match with any correct parse listed for the sentence. Second is &amp;quot;exact syntactic match': exact match with the bracket locations and rule names only. Notice that in a parse considered correct by our second metric, the syntax 9 of all tags must be correct.</Paragraph>
      <Paragraph position="3"> The average number of different correct &amp;quot;exact syntactic matches ''1deg per sentence in our test data is 3. Among test-data sentences, 72% have more than one correct exact syntactic matches, and 32% have 5.11 For critiques of other approaches to broad--coverage parser and tagger evaluation, see (Black, 1994).</Paragraph>
      <Paragraph position="4"> It is worth inquiring how well expert humans do at the parsing task that we are attempting here by macblne. Accordingly, we present statistics below on the consistency and accuracy of expert h, lm~ at parsing using the ATR English Grammar. The ATR/Lancaster treeb~nk~ng effort features a grammarian, who originated the Grammar, and a treebanking team, who apply the Grammar to treebank text. We can therefore distinguish two different types of evaluation as to how well expert humans do at parsing using the Grammar: consistency and accuracy. Consistency is the degree to which all team members posit the identical parse for the identical sentence in the identical document of test data. Accuracy is the expected rate of agreemnt between a treeb~-lcer and the grammarian on parsing a given sentence in a given document of test data.</Paragraph>
      <Paragraph position="5"> In a first experiment to determine consistency, we asked each of the three te~.m members to</Paragraph>
      <Paragraph position="7"> match one of the human-produced parses. &amp;quot;Cross&amp;quot; indicates percentage of test-data sentences whose top-r~nlced parse contains 0 instances of &amp;quot;crossing brackets&amp;quot; with respect to the most probable treebank parse of the sentence.</Paragraph>
      <Paragraph position="8"> been generated with respect to our Gramrn~r, by trained tmrn~.n.C/, but whose skills at parsing with the Grammar were not as good as those of our three team members. 384 sentences of test data were utilized. The result was a 6.?% expected rate of disagreement among the team members on this task. 12 In a second consistency experiment, we located all sentences occurring twice or more in the Treeb~nk; if there were more than two duplicates, we selected just two at random. We then determined the number of duplicate-sentence pairs that were exact matches in terms of the way they were parsed and tagged. ?6% of these 248 sentence pairs were such exact matches, is Finally, in an experiment to determine accuracy of our team members' parsing using the Grammar, the ATR grammarian scored for parsing and tagging accuracy some 308 sentences of Tr-eebank data from randomly-selected Treebank documents. 14 The result of this scoring was a 8.4% expected parsing error rate. 15</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="24" type="sub_section">
      <SectionTitle>
4.2. Experimental Results
</SectionTitle>
      <Paragraph position="0"> As discussed in 3.1, our first step in parsing is to tag each sentence. The tagger currently produces an exact match 74% of the time for the 47,800-word test set, comparing against a single tag sequence for each sentence, le We present parsing results both for text which starts out correctly tagged (Table 1) 17 and for raw text (Table 2). R.esults for parsing from raw text are given for both the exact-match and exact-syntactic-match criteria described in 4.1.</Paragraph>
      <Paragraph position="1"> The performance of the parser on short sentences of correctly tagged data is extrememly good.</Paragraph>
      <Paragraph position="2"> We feel this indicates that the models are performing well in scoring the parses.</Paragraph>
      <Paragraph position="3"> The results deteriorate rapidly for longer sentences, but we believe the problem lies in the search procedure rather than the models. A measure of the performance of a search is whether it ~2In a parallel experiment to determine consistency on tagging, we asked each of the three team members to choose the first correct tag from a raaked list of tags for each word of each sentence of test data. These ranked lists were hand-constructed, and an effort was made to make them as difficult as possible to choose from. About 4,800 words (152 sentences) of test data were utilized. The result was a 3.1% expected rate of disagreement among the team members on the exact choice of tag.</Paragraph>
      <Paragraph position="4"> 1sOl these 248 sentence pairs, 85~ were exact matches in terms of the way they were tagged.</Paragraph>
      <Paragraph position="5"> 14Actually, the documents were selected from our &amp;quot;main General-English Treebank&amp;quot; of 800,000 words. l~i.e..the parse was wrong if even one tag was wrong; or, of couree, ira rule choice was wrong. For the tags assigned r,o the roughly 5000 words in these 308 sentences, expected error rate was 2.9%. Essentially none of these tagging e~rors had to do with the use of the syntactic portion of our tags; all of the errors were semantic; the same was true in the two tagging consistency experiments related above.</Paragraph>
      <Paragraph position="6"> leas noted in 4.1 fn. 8, our experience indicates that we can expect a roughly 10~o improvement in this score when we compare performance against &amp;quot;golden--standard&amp;quot; test data in which all correct answers are indicated; this would bring our tagging accuracy into the 80-percent area.</Paragraph>
      <Paragraph position="7"> lrFor the definition of the term ~crossing brackets&amp;quot; used in Table 1: see (Harrison et al., 1991).</Paragraph>
      <Paragraph position="8">  Parsing from raw text: exact match syntactic exact match i top I top 10 top top 10 i 34.5%:40.1% ......... 50.4% 62.3'~o I 1.2% ~ 3.6% 11.3% 25.6% I percentage of parses which exactly match one of the human-produced parses (&amp;quot;exact match&amp;quot;) or which match bracket locations, role names, and syntactic part-of--speech tags only (&amp;quot;syntactic exact match&amp;quot;).</Paragraph>
      <Paragraph position="9">  suggests any candidates which are as likely as the correct un~wer. If not, the parser has erred by &amp;quot;omrn~sion&amp;quot; rather than by &amp;quot;commission': it has ommitted the correct parse from consideration, but not because it seemed ,mJ~lrely. It is entirely possible that the correct parse is in fact among the highest-scoring parses. These types of search error are non--existent for exhaustive search, but become important for sentences between 11 and 15 words in length, and dominate the results for longer sentences.</Paragraph>
      <Paragraph position="10"> The results in Table 2 reflect tagging accuracy as well as the pefformaace of the parser models per se. Note that tagging accuracy is quoted on a per-word basis, as is customary. From previous work, we estimate the accuracy of the tagger on the syntactic portion of tags to be about 94%. Thus there is typically at least one error in semantic assignment in each sentence, and an error in syntactic assignment in one of every two sentences. It is not surprising, .then, that the per-sentence parsing acclzracy suffers when parses are predicted from raw text.</Paragraph>
      <Paragraph position="11"> Clearly the present research task is quite considerably harder than the parsing and tagging tasks undertaken in (Jelinek et al., 1994; Magerman, 1995; Black et al., 1993b), which would seem to be the closest work to ours, and any comparison between this work and ours must be approached with extreme caution. Table 3 shows the differences between the treebank~ utilized in (Jelinek et al., 1994) on the one hand, and in the work reported here, on the other, is Table 4 shows relevant lSFigures for Average Sentence Length ('l~raLuing Corpus) and Training Set Size, for the IBM ManuaLs Corpus, are approximate, and cz~e fzom (Black et aL, 1993a).</Paragraph>
      <Paragraph position="13"> parsing results by (Jelinek et al., 1994). Even starker contrasts obtain between the present results and those of e.g. (Magerman, 1995; Black et at., 1993b), who do not employ an exact-match evaluation criterion, further obscuring possible performance comparisons. Obviously, no direct comparisons of the results of Tables 1-2 with previous parsing work is possible, as we are the first to parse using the Treebank.</Paragraph>
      <Paragraph position="14"> In our current research, we are emphasizing the creation of decision-tree questions for predicting semantic categories in tagging, as well as continuing to develop questions for syntactic tag prediction, and for our nile-name-prediction model.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML