File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/90/h90-1055_concl.xml

Size: 3,819 bytes

Last Modified: 2025-10-06 13:56:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1055">
  <Title>Deducing Linguistic Structure from the Statistics of Large Corpora</Title>
  <Section position="5" start_page="278" end_page="280" type="concl">
    <SectionTitle>
4 Penn Treebank
</SectionTitle>
    <Paragraph position="0"> In this section, we report some recent performance measures of the Penn Treebank Project.</Paragraph>
    <Paragraph position="1"> To date, we have tagged over 4 million words by part of speech (cf. Table 1). We are tagging this material with a much simpler tagset than used by previous projects, as discussed at the Oct. 1989 DARPA Workshop. The material is first processed using Ken Church's tagger (Church 1988), which labels it as if it were Brown Corpus material, and then is mapped to our tagset by a SEDscript. Because of fundamental differences in tagging strategy between the Penn Treebank Project and the Brown project, the resulting mapping is about 9% inaccurate, given the tagging guidelines of the Penn Tree-bank project (as given in 40 pages of explicit tagging guidelines). This material is then hand-corrected by our annotators; the result is consistent within annotators to about 3% (cf. Table 3), and correct (again, given our tagging guidelines) to about 2.5% (cf. Table 2), as will be discussed below. We intend to use this material to retrain Church's tagger, which we then believe will be accurate to less than 3% error rate. We will then adjudicate between the output of this new tagger, run on the same corpus, and the previously tagged material. We believe that this will yield well below 1% error, at an additional cost of between 5 and 10 minutes per 1000 words of material. To provide exceptionally accurate bigram frequency evidence for retraining the automatic tagger we are using, two subcorpora (Library of America, DOE abstracts) were tagged twice by different annotators, and the Library of America texts were adjudicated by a third annotator, yielding ~160,000 words tagged with an accuracy estimated to exceed 99.5%.</Paragraph>
    <Paragraph position="2"> Table 2 provides an estimate of error rate for part-of-speech annotation based on the tagging of the sample described above. Error rate is measured in terms of the  number of disagreements with a benchmark version of the sample prepared by Beatrice Santorini. We have also estimated the rate of inter-annotator inconsistency based on the tagging of the sample described above (cf.</Paragraph>
    <Paragraph position="3"> Table 3). Inconsistency is measured in terms of the proportion of disagreements of each of the annotators with each other over the total number of words in the test corpus (5,425 words).</Paragraph>
    <Paragraph position="4"> Table 4 provides an estimate of speed of part-of-speech annotation for a set of ten randomly selected texts from the DoT Jones Corpus (containing a total of 5,425 words), corrected by each of our annotators. The annotators were throughly familiar with the genre, having spent over three months immediately prior to the experiment correcting texts from the same Corpus. Given that the average productivity overall of our project has been between 3,000-3,500 words per hour of time billed by our annotators, it appears that our strategy of hiring annotators for no more than 3 hours a day has proven to be quite successful.</Paragraph>
    <Paragraph position="5"> Finally, the summary statistics in Table 5 provide an estimate of improvement of annotation speed as a function of familiarity with genre. We compared the annotators' speed on two samples of the Brown Corpus (10 texts) and the DoT Jones Corpus (100 texts). We examined the first and last samples of each genre that the  annotators tagged; in each case, more than two months of experience lay between the samples.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML