File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1101_intro.xml

Size: 2,221 bytes

Last Modified: 2025-10-06 14:06:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1101">
  <Title>Bayesian Stratified Sampling to Assess Corpus Utility</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 The Query
</SectionTitle>
    <Paragraph position="0"> The query we addressed in this paper grew out of an attempt to establish basic statistics for Federal Register documents. When counting documents and determining their length, we noticed that some purported documents (as judged by &lt;DOC&gt; &lt;/DOC&gt; bracketing) were not what we came to define as real Federal Register documents: documents describing the activities of the federal government. Besides real documents, the electronic Register contained pseudo-documents related to the use and publication of the paper version of the Register, such as tables of contents, indices, blank pages, and title pages.</Paragraph>
    <Paragraph position="1"> This discovery at first appeared to be a mere nuisance. We assumed that there was an easy way to separate pseudo-documents from real documents, but could not find one. The harder we looked for a way to separate the two document types, the more we realized that this distinction had theoretical interest. Determining the percentage of real documents would serve to evaluate the true size of the corpus, and its usefulness for TIPSTER type applications where documents relevant to topic queries are expected to be returned.</Paragraph>
    <Paragraph position="2"> This query matched the two criteria set forth in the Introduction for applicability to our method. As described above, there was no easy way to separate real documents from pseudo-documents. The query was also subjective, since readers might disagree about the classification of particular documents. For example, a document announcing classes on how to use the Federal Register could be considered a real document (since notices of all sorts appear in the Register), or a pseudo-document (since it is promulgated by the Register's office and appears at regular intervals). As another example, readers might disagree about which erratum documents are significant enough to be considered real documents themselves.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML