File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0213_intro.xml
Size: 2,669 bytes
Last Modified: 2025-10-06 14:02:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0213"> <Title>The Potsdam Commentary Corpus</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A corpus of German newspaper commentaries has been assembled at Potsdam University, and annotated with different linguistic information, to different degrees. Two aspects of the corpus have been presented in previous papers ((Reitter, Stede 2003) on underspecified rhetorical structure; (Stede 2003) on the perspective of knowledge-based summarization). This paper, however, provides a comprehensive overview of the data collection effort and its current state.</Paragraph> <Paragraph position="1"> At present, the 'Potsdam Commentary Corpus' (henceforth 'PCC' for short) consists of 170 commentaries from M&quot;arkische Allgemeine Zeitung, a German regional daily. The choice of the genre commentary resulted from the fact that an investigation of rhetorical structure, its interaction with other aspects of discourse structure, and the prospects for its automatic derivation are the key motivations for building up the corpus. Commentaries argue in favor of a specific point of view toward some political issue, often dicussing yet dismissing other points of view; therefore, they typically offer a more interesting rhetorical structure than, say, narrative text or other portions of newspapers.</Paragraph> <Paragraph position="2"> The choice of the particular newspaper was motivated by the fact that the language used in a regional daily is somewhat simpler than that of papers read nationwide. (Again, the goal of automatic analysis was responsible for this decision.) This is manifest in the lexical choices but also in structural features. As an indication, in our core corpus, we found an average sentence length of 15.8 words and 1.8 verbs per sentence, whereas a randomly taken sample of ten commentaries from the national papers S&quot;uddeutsche Zeitung and Frankfurter Allgemeine has 19.6 words and 2.1 verbs per sentence. The commentaries in PCC are all of roughly the same length, ranging from 8 to 10 sentences. For illustration, an English translation of one of the commentaries is given in Figure 1.</Paragraph> <Paragraph position="3"> The paper is organized as follows: Section 2 explains the different layers of annotation that have been produced or are being produced. Section 3 discusses the applications that have been completed with PCC, or are under way, or are planned for the future. Section 4 draws some conclusions from the present state of the effort.</Paragraph> </Section> class="xml-element"></Paper>