File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-1217_abstr.xml
Size: 944 bytes
Last Modified: 2025-10-06 13:41:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1217"> <Title>How Should a Large Corpus Be Built?-A Comparative Study of Closure in AAnnotated Newspaper Corpora from Two Chinese Sources, Towards Building A Larger Representative Corpus Merged from Representative Sublanguage Collections</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This study measures comparative lexical and syntactic closure rates in annotated Chinese newspaper corpora from the Academica Sinica Balanced Corpus and the University of Pennsylvania's Chinese Treebank. It then draws inferences as to how large such corpora need be to be representative models of subject-matterconstrained language domains within the same genre. Future large corpora should be built incrementally only by combining smaller representative sublanguage collections.</Paragraph> </Section> class="xml-element"></Paper>