File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1217_metho.xml

Size: 6,673 bytes

Last Modified: 2025-10-06 14:07:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1217">
  <Title>How Should a Large Corpus Be Built?-A Comparative Study of Closure in AAnnotated Newspaper Corpora from Two Chinese Sources, Towards Building A Larger Representative Corpus Merged from Representative Sublanguage Collections</Title>
  <Section position="4" start_page="116" end_page="119" type="metho">
    <SectionTitle>
3 Lexical and Syntactic Closure
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
3.1 Tokenization in Chinese
</SectionTitle>
      <Paragraph position="0"> It should be pointed out that Chinese is an agglutinative, not an inflected language. Moreover, while Chinese tokens can concatenate, Chinese has no extensive morphology like many Indo-European languages. Chinese, of course, has no white space separating lexemes, as a result, all Chinese text must first be segmented into word lengths. However, once a text has been segmented, no stemming is needed so each segmented Chinese word can be counted as it occurs without the need of finding its lemma.</Paragraph>
    </Section>
    <Section position="2" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
3.2 &amp;quot; Lexical Closure
</SectionTitle>
      <Paragraph position="0"> Lexical closure is that property of a collection of text whereby given a representative sample, the number of new lexical forms seen among every additional 1,000 new tokens begins to level off at a rate below 10 per cent.</Paragraph>
    </Section>
    <Section position="3" start_page="116" end_page="119" type="sub_section">
      <SectionTitle>
3.3 Syntactic Closure
</SectionTitle>
      <Paragraph position="0"> Syntactic closure is that property of a collection of text whereby given a representative sample of a type of text, then the number of new syntactic forms seen among every additional 1,000 new tokens begins to level off. A syntactic form is the combination of token plus type. Thus syntactic closure approaches as the number of new grammatical uses for a previously observed token plus the number of new tokens, regardless of syntactic use, level off to a growth rate below  While it is common practice to attempt to build huge annotated corpora, it is of course very tedious, very expensive, and especially challenging for annotators to maintain consistency over such a huge task. Consequently one must hope that once an annotated corpus of newspaper texts is created, it can be statistically measured and confirmed to be a representative sample.</Paragraph>
      <Paragraph position="1"> I first measured lexical and syntactic closure rates in all ASBC newspaper texts but found that when viewed as a whole this newspaper sub-collection of the ASBC does not approach closure (see graphs below).</Paragraph>
      <Paragraph position="3"> This raises the question how cart we hope for NLP applications to learn on large corpora if they themselves never approach statistical closure, never approach being statistically confirmed as a representative model of the language ? I then focused downward on subsections of the newspaper corpus-grouping them by similar filename. I searched the ASBC corpus looking for files of annotated newspaper text and found a total of 57 files (18.7 Mb); my findings are summarized in the following table.</Paragraph>
      <Paragraph position="4">  A .... '94 Academics C .... '93 Various News SSSLA '91 Politics etc.</Paragraph>
      <Paragraph position="5"> T .... '91-'95 Sports etc.</Paragraph>
      <Paragraph position="6">  The large single file, n~.med &amp;quot;SSLA', dealt with a wide assortment of subject matter and thus was significantly different from the other 3 newspaper collections. Not only was its individual file size rather large; it was not even close to the size and homogeneity of the other three newspaper multi-file collections. I rejected it from further study.</Paragraph>
      <Paragraph position="7"> The other sub-collections were more similar. Topically speaking, the ASBC &amp;quot;A&amp;quot; newspaper collection was focused primarily on news (77 per cent) while at the same time focusing narrowly on academic events in 1994. The ASBC &amp;quot;C&amp;quot; newspaper collection was less narrowly focused on news (73.5 per cent) but expanded its focus to other than academia while limiting itself to events of 1993. The ASBC &amp;quot;T&amp;quot; newspaper collection, however, spanned the period 1991 through 1995 and dealt with many different subjects, the most frequent of which were sports, news, and domestic politics, but even each of these most frequent subjects only represented 9 per cent of the whole.</Paragraph>
      <Paragraph position="8"> Let us consider the three ASBC newspaper sub-collections (&amp;quot;A', &amp;quot;C', and &amp;quot;T&amp;quot; filenames) to be potentially representative sublanguages. If we can observe relatively high degrees of closure at various levels of description, we can propose that such sub-collections are representative sublanguages within the newspaper genre of Chinese natural language. Conversely, those which do not have a high degrees of closure are definitely not sublanguage corpora and not of further interest for this study. The following graphs depict the observed lexical and syntactic closure rates of the three ASBC newspaper sub-collections under study.</Paragraph>
      <Paragraph position="10"> It appears that the ASBC &amp;quot;A&amp;quot; newspaper sub-collections does approach lexical closure; while the &amp;quot;C&amp;quot; and &amp;quot;T&amp;quot; newspaper sub-collections definitely do not.</Paragraph>
      <Paragraph position="11"> It appears that the ASBC &amp;quot;A&amp;quot; newspaper sub-collection also approaches syntactic closure; while the &amp;quot;C&amp;quot; and &amp;quot;T&amp;quot; newspaper sub-collections do not.</Paragraph>
      <Paragraph position="12">  I next applied the same measures on the UPenn Chinese Treebank corpus. I wanted to compare the rate at which the UPenn collection approaches lexical and syntactic closure with that of the ASBC &amp;quot;A&amp;quot; and &amp;quot;T&amp;quot; sub-collections. The 329 Xinhua newswire documents in the UPenn Chinese Treebank annotated corpus came from two sub-collections and total 3,289 sentences averaging 27 words or 47 characters per sentence excluding newspaper headlines which are characteristica,Uy highly abbreviated clauses.</Paragraph>
      <Paragraph position="13">  UPe~ ~ IJx|C/l ~o't~re ,o=&lt;, /ii:--..-~:;. ~7;!~7'.. ~&amp;quot;!?~;.i~:7i;;7 :-.17.!7;~\]:!. 7.'i7 7: L.! :~ i ~ ' ........ &amp;quot; .....</Paragraph>
      <Paragraph position="15"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML