File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-6010_intro.xml
Size: 5,319 bytes
Last Modified: 2025-10-06 14:03:01
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-6010"> <Title>Some remarks on the Annotation of Quantifying Noun Groups in Treebanks</Title> <Section position="2" start_page="0" end_page="81" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> There is an increasing number of linguists interested in large syntactically annotated corpora, i.e. so-called treebanks. Treebanks consist of language samples annotated with structural information - centrally, the grammatical structure of the samples, though some resources include categories of information other than grammar sensu stricto. Data contained in treebanks are useful for diverse theoretical and practical purposes: Combining raw language data with linguistic information offers a promising basis for the development of new, efficient and robust natural language processing methods. Real-world texts annotated with different strata of linguistic information can serve as training and test material for stochastic approaches to natural language processing. At the same time, treebanks are a valuable source of data for theoretical linguistic investigations about language use. The data-drivenness of this approach presents a clear advantage over the traditional, idealised notion of competence grammar.</Paragraph> <Paragraph position="1"> According to Skut et al. (1997) treebanks have to meet the following requirements: 1. descriptivity: a0 grammatical phenomena are to be described rather than explained; 2. theory-independence: a0 annotations should not be influenced by theory-specific considerations; nevertheless, different theory-specific representations should be recoverable from the annotation; 3. multi-stratal representations: a0 clear separation of different description levels; and 4. data-drivenness: a0 the annotation scheme must provide representational means for all phenomena occurring in texts.</Paragraph> <Paragraph position="2"> The most important treebank for English nowadays is the Penn Treebank (cf. (Marcus et al., 1994)). Many statistical taggers and parsers have been trained on it, e.g. Ramshaw and Marcus (1995), Srinivas (1997) and Alshawi and Carter (1994). Furthermore, context-free and unification-based grammars have been derived from the Penn Treebank (cf. (Charniak, 1996) and (van Genabith et al., 1999)). These parsers, trained or created by means of the treebank, can be applied for enlarging the treebank.</Paragraph> <Paragraph position="3"> For German, the first initiative in the field of treebanks was the NEGRA Corpus (cf. (Skut et al., 1998)) which contains approximately 20.000 sentences of syntactically interpreted newspaper text. Furthermore, there is the Verbmobil Corpus (cf. (Wahlster, 2000)) which covers the area of spoken language.</Paragraph> <Paragraph position="4"> TIGER (cf. (Brants et al., 2002)) is the largest and most exhaustively annotated treebank for German. It consists of more than 35.000 syntactically annotated sentences. The annotation format and scheme are based on the NEGRA corpus. The linguistic annotation of each sentence in TIGER is represented on a number of different levels: a0 part-of-speech (pos) information is encoded in terminal nodes; a0 non-terminal nodes are labelled with phrase categories; a0 the edges of a tree represent syntactic functions; and a0 secondary edges, i.e. labelled directed arcs between arbitrary nodes, are used to encode coordination information.</Paragraph> <Paragraph position="5"> Syntactic structures are rather flat and simple in order to reduce the potential for attachment ambiguities. The distinction between adjuncts and arguments is not expressed in the constituent structure, but is instead encoded by means of syntactic functions.</Paragraph> <Paragraph position="6"> In this article, we attend to the question of how to analyze and annotate quantifying noun groups in German in a way that is most informative and intuitive for treebank users, i.e., linguists as well as NLP applications. In the following section, we first give an overview of the different kinds of quantitative specification in German. In the following section, the general structure of quantifying noun groups is depicted and subclasses are introduced. We concentrate on the description of measure constructions and count constructions. In the third section, we show how these constructions are annotated in TIGER and NEGRA so far and why the existing annotation is not sufficient for linguistically more demanding tasks. We make several refinement proposals for the annotation scheme. A grammar induction experiment shows the benefit that can be gained from the proposed refinements. In section 5, some concluding remarks follow.</Paragraph> <Paragraph position="7"> The information gained by corpus-based research is used in the framework of SILVA, a finite-state based symbolic system for parsing and for the extraction of deep linguistic information from unrestricted German text. Since, this paper presents research that is part of the agenda for developing SILVA, our analyses are tributary to this overall goal. And, since our parsed output should be a reasonable basis for linguistically more demanding extraction tasks, a highly informative annotation is one of our main goals.</Paragraph> </Section> class="xml-element"></Paper>