File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/p96-1052_intro.xml

Size: 2,352 bytes

Last Modified: 2025-10-06 14:06:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1052">
  <Title>References</Title>
  <Section position="4" start_page="0" end_page="363" type="intro">
    <SectionTitle>
2 Corpus-based Approach
</SectionTitle>
    <Paragraph position="0"> The best data source for observation of grammatical punctuation usage is a large, parsed corpus. It ensures a wide range of real language is covered, and because of its size it should minimise the effect of any  errors or idiosyncrasies on the part of editors, parsers and transcribers. Since these corpora are almost all hand-produced, some errors and idiosyncrasies are inevitable -- one important part of the analysis is therefore to identify possible instances of these, and if they are cleat, to remove them from the results.</Paragraph>
    <Paragraph position="1"> The corpus chosen was the bow Jones section of the Penn Treebank (size: 1.95 million words). The bracketings were analysed so that each node with a punctuation mark as its immediate daughter is reported, with its other daughters abbreviated to their categories, as in (1) - (3).</Paragraph>
    <Paragraph position="2">  (1) \[NP \[NP the following\] : \] ==~ \[UP = NP :\] (2) Is \[PP In Edinburgh\] , \[s ...\] \] ==~ Is = PP, s\] (3) \[NP \[UP Bob\] , \[NP ...) , \] ==&gt; \[NP = NP , NP, \]  In this fashion each sentence was broken down into a set of such category-patterns, resulting in a set of different category-patterns for each punctuation symbol, which were then processed to extract the underlying rule patterns which represent all the ways that punctuation behaves in this corpus, and are good indicators of how the punctuation marks might behave in the rest of language.</Paragraph>
    <Paragraph position="3"> There were 12,700 unique category-patterns extracted from the corpus for the five most common marks of point punctuation, ranging from 9,320 for the comma to 425 for the dash. These were then reduced to just 137 underlying rule-patterns for the colon, semicolon, dash, comma, full-stop.</Paragraph>
    <Paragraph position="4"> Even some of these underlying rule-patterns, however, were questionable since their incidence is very low (maybe once in the whole corpus) or their form is so linguistically strange so as to call into doubt their correctness (possibly idiosyncratic misparses), as in (4).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML