File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-3008_intro.xml

Size: 6,953 bytes

Last Modified: 2025-10-06 14:03:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-3008">
  <Title>Discursive Usage of Six Chinese Punctuation Marks</Title>
  <Section position="4" start_page="0" end_page="44" type="intro">
    <SectionTitle>
2 Overview of Chinese RST treebank
</SectionTitle>
    <Paragraph position="0"> under construction</Paragraph>
    <Section position="1" start_page="0" end_page="43" type="sub_section">
      <SectionTitle>
2.1 Corpus data
</SectionTitle>
      <Paragraph position="0"> For the purpose of language engineering and linguistic investigation, we are constructing a Chinese corpus comparable to the English WSJ-RST treebank and the German Potsdam Commentary Corpus (Carlson et al. 2003; Stede 2004). Texts in our corpus were downloaded from the official website of People's Daily  Caijinpinglun (CJPL) in Chinese means &amp;quot;financial and business commentary&amp;quot;, and usually covers various topics in social economic life, such as fiscal policies, financial reports,  by major media entities were republished. With over 400 authors and editors involved, our texts can be regarded as a good indicator of the general use of Chinese by Mainland native speakers.</Paragraph>
      <Paragraph position="1"> At the moment our CJPL corpus has a total of 395 texts, 785,045 characters, and 84,182 punctuation marks (including pruned spaces).</Paragraph>
      <Paragraph position="2"> Although on average there are 9.3 characters between every two marks, sentences in CJPL are long, with 51.8 characters per common sentence delimiters (Full Stop, Question Mark and Exclamation Mark).</Paragraph>
    </Section>
    <Section position="2" start_page="43" end_page="44" type="sub_section">
      <SectionTitle>
2.2 Segmentation
</SectionTitle>
      <Paragraph position="0"> We are informed of the German Potsdam Commentary Corpus construction, in which they (Reitter 2003) designed a program for automatic segmentation at clausal level after each</Paragraph>
      <Paragraph position="2"> . Human interference with the segmentation results was not allowed, but annotators could retie over-segmented bits by using the JOINT relation.</Paragraph>
      <Paragraph position="3"> Given the workload of discourse annotation, we decided to design a similar segmentation program. So we first normalized different encoding systems and variants of PMs (e.g.</Paragraph>
      <Paragraph position="4"> Dashes and Ellipses of various lengths), and then conducted a survey on the distribution (Fig. 1) and syntax of major Chinese punctuation marks (e.g. syntax of Chinese Dash in Table 1).</Paragraph>
      <Paragraph position="5">  C-Comma-1 is the most frequently used PM in the Chinese corpus. While it does delimit clauses, a study on 200 randomly selected C-Comma-1 tokens in our corpus shows that 55 of them are trading, management, economic conferences, transportation, entertainment, education, etc.</Paragraph>
      <Paragraph position="6"> Collected by professional editors, most texts in our corpus are commentaries; some are of marginal genres by the Chinese standards.</Paragraph>
      <Paragraph position="7">  Dash, as a Sign= &amp;quot;$(&amp;quot;, was not selected as a unit delimiter in the Potsdam Commentary Corpus.</Paragraph>
      <Paragraph position="8">  PMs are counted by individual symbols.</Paragraph>
      <Paragraph position="9"> used after an independent NP or discourse marker. This rate, times the total number of C-Comma-1, means we would have to retie a huge number of over-segmented elements. So we decided not to take C-Comma-1 as a delimiter of our Elementary Unit of Discourse Analysis (EUDA) for the present.</Paragraph>
      <Paragraph position="10">  of the texts. Other than these, 56.5% of the colons are used between clausal strings, only 0.6% of the colons are used after non-clausal strings.</Paragraph>
      <Paragraph position="11"> 99.6% instances of Exclamation Mark, Question Mark, Dash, Ellipses and Semicolon in the Chinese corpus are used after clausal strings. In our corpus, 4.3% of the left quotation marks do not have a right match to indicate the end of a quote. Because many articles do not give clear indications of direct or indirect quotes  , it is very difficult for the annotator to makeup.</Paragraph>
      <Paragraph position="12"> Parentheses and brackets have a similar problem, with 3.2% marks missing their matches.  The symbol &amp;quot;S&amp;quot; donates sentences with a common end mark, while &amp;quot;s&amp;quot; denotes structures orthographically end with one of the PMs studied here. &amp;quot;+&amp;quot; means one or more occurrences, &amp;quot;*&amp;quot; means zero or more occurrences. The category after a bracket pair indicates the syntactic role played by the unit enclosed, for example &amp;quot;[......]NP&amp;quot; means the ellipses functions as an NP within a clausal structure. &amp;quot;&lt;para&gt;&lt;/para&gt;&amp;quot; denotes paragraph opening and ending.  By &amp;quot;Structural elements&amp;quot; we mean documentary information, such as Publishing Date, Source, Link, Editor, etc. Although these are parts of a news text, they are not the article proper, on which we annotate rhetorical relations.  After a comparative study on the rhetorical structure of news published by some Hong Kong newspapers in both English and Chinese, Scollon and Scollon (1997) observed that &amp;quot;quotation is at best ambiguous in Chinese. No standard practice has been observed across newspapers in this set and even within a newspaper, it is not obvious which portions of the text are attributed to whom.&amp;quot; We notice that Mainland newspapers have a similar phenomenon.</Paragraph>
      <Paragraph position="13">  Besides, 53.9% of the marks appear in structural elements that we didn't intend to analyze  .</Paragraph>
      <Paragraph position="14"> Finally, we decided to use Period, the End-of-line symbol, and these six marks (Question Mark, Exclamation Mark, Colon, Semicolon, Ellipsis and Dash) as delimiters of our EUDA. Quotation mark, Parentheses, and Brackets were not selected.</Paragraph>
      <Paragraph position="15"> A special program was designed to conduct the segmentation after each delimiter, with proper adjustment in cases when the delimiter is immediately followed by a right parenthesis, a right quotation mark, or another delimiter.</Paragraph>
      <Paragraph position="16"> A pseudo-relation, SAME-UNIT, has been used during annotation to re-tie any discourse segment cut by the segmentation program into fragments.</Paragraph>
    </Section>
    <Section position="3" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
2.3 Annotation and Validity Control
</SectionTitle>
      <Paragraph position="0"> We use O'Donnell's RSTTool V3.43  as our annotation software. We started from the Extended-RST relation set embedded in the software, adding gradually some new relations, and finally got an inventory of 47 relations. We take the same rhetorical predicate with switched arguments as different relations, for instance,</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML