File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1501_metho.xml
Size: 8,361 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1501"> <Title>Dependency and relational structure in treebank annotation</Title> <Section position="3" start_page="0" end_page="3" type="metho"> <SectionTitle> 2 Annotation of the Relational </SectionTitle> <Paragraph position="0"> Structure In practice, all the existing treebank schemata implement some form of relational structure. Annotation schemata range from pure (dependency) RS-based approaches to RS-PS combinations (Abeill'e, 2003).</Paragraph> <Paragraph position="1"> Some treebanks consider the relational information as the exclusive basis of the annotation. The Prague Dependency Treebank ((HajiVcov'a and Ceplov'a, 2000), (B&quot;ohmov'a et al., 2003)) implements a three level annotation scheme where both the analytical (surface syntactic) and tectogrammatical level (deep syntactic and topic-focus articulation) are dependency-based; the English Dependency Treebank (Rambow et al., 2002) implements a dependency-based mono-stratal analysis which encompasses surface and deep syntax and directly represents the predicate-argument structure. Other projects adopt mixed formalisms where the sentence is split in syntactic subunits (phrases), but linked by functional or semantic relations, e.g. the Negra Treebank for German ((Brants et al., 2003), (Skut et al., 1998)), the Alpino Treebank for Dutch (van der Beek et al., 2002), and the Lingo Redwood Treebank for English (Oepen et al., 2002). Also in the Penn Treebank ((Marcus et al., 1993), (Marcus et al., 1994)) a limited set of relations is placed over the constituency-based annotation in order to make explicit the (morpho-syntactic or semantic) roles that the constituents play.</Paragraph> <Paragraph position="2"> The choice of a RS-based annotation schema can depend on theoretical linguistic motivations (a RS-based schema allows for an explicit, fine-grained representation of several linguistic phenomena), task-dependent motivations (the RS-based schema represents the linguistic information involved in the task(s) at hand), language-dependent motivations (the relational structure is traditionally considered as the most adequate representation of the object language).</Paragraph> <Paragraph position="3"> Theoretical motivations for exploiting representations based on forms of RS was developed in the several RS-based theoretical linguistic frameworks (e.g. Lexical Functional Grammar, Relaional Grammar and dependency grammar), which allow for capturing information involved at various level (e.g. syntactic and semantic) in linguistic structures, and grammatical formalisms have been proposed with the aim to capture the linguistic knowledge represented in these frameworks. Since the most immediate way to build wide-coverage grammars is to extract them directly from linguistic data (i.e.</Paragraph> <Paragraph position="4"> from treebanks), the type of annotation used in the data is a factor of primary importance, i.e.</Paragraph> <Paragraph position="5"> a RS-based annotation allows for the extraction of a more descriptive grammar</Paragraph> <Paragraph position="7"> See (Mazzei and Lombardo, 2004a) and (Mazzei and Lombardo, 2004b) for experiments of LTAG extraction from TUT.</Paragraph> <Paragraph position="8"> Task-dependent motivations rely on how the annotation of the RS can facilitate some processing aspects of NLP applications. The explicit representation of predicative structures allowed by the RS can be a powerful source of disambiguation. In fact, a large amount of ambiguity (such as coordination, Noun-Noun compounds and relative clause attachment) can be resolved using such a kind of information, and relations can provide a useful interface between syntax and semantics. (Hindle and Rooth, 1991) had shown the use of dependency in Prepositional Phrase disambiguation, and the experimental results reported in (Hockenmaier, 2003) demonstrate that a language model which encodes a rich notion of predicate argument structure (e.g. including long-range relations arising through coordination) can significantly improve the parsing performances. Moreover, the notion of predicate argument structure has been advocated as useful in a number of different large-scale language-processing tasks, and the RS is a convenient intermediate representation in several applications (see (Bosco, 2004) for a survey on this topic). For instance, in Information Extraction relations allows for recognizing different guises in which an event can appear regardless of the several different syntactic patterns that can be used to specify it (Palmer et al., 2001) .In Question Answering, systems usually use forms of relation-based structured representations of the input texts (i.e. questions and answers) and try to match those representations (see e.g. (Litkowski, 1999), (Buchholz, 2002)).</Paragraph> <Paragraph position="9"> Also the in-depth understanding of the text, necessary in Machine Translation task, requires the use of relation-based representations where an accurate predicate argument structure is a critical factor (Han et al., 2000) .</Paragraph> <Paragraph position="10"> Language-dependent motivations rely on the fact that the dependency-based formalisms has been traditionally considered as the most adequate for the representation of free word order languages. With respect to constituency-based Various approaches to IE (Collins and Miller, 1997) address this issue by using relational representations, that is forms of &quot;concept nodes&quot; which specifies a trigger word (usually a Verb) and also forms of mapping between the syntactic and the semantic relations of the trigger.</Paragraph> <Paragraph position="11"> The system presented in (Han et al., 2000) generates the dependency trees of the source language (Korean) sentences, then directly maps them to the translated (English) sentences.</Paragraph> <Paragraph position="12"> formalisms, free word order languages involves a large amount of discontinuous constituents (i.e. constituents whose parts are not contiguous in the linear order of the sentence). In practice, a constituency-based representation was adopted for languages with rather fixed word order patterns, like English (Penn Treebank), while a dependency representation for languages which allow variable degrees of word order freedom, such as Czech (see Prague Dependency Treebank) or Italian (as we will see later, TUT). Nevertheless, in principle, since the representation of a discontinuous constituent X can be addressed in various ways (e.g. by introducing lexically empty elements co-indexed with the moved parts of X), the presence to a certain extent of word order freedom does not necessarily mean that a language has to be necessarily annotated according to a relation-based format rather than a constituency-based one. Moreover, free word order languages can present difficulties for dependency-based as well as for constituency-based frameworks (e.g. non-projective structures). The development of dependency-based treebanks for English (see English Dependency Treebank) together with the inclusion of relations in constituency-based treebanks (see Penn Treebank) too, confirms the strongly prevailing relevance of motivations beyond the language-dependent ones.</Paragraph> <Paragraph position="13"> The types of knowledge that many applications actually need are RS-based representations where predicate argument structure and the associated morphological and syntactic information can operate as an interface to a semantic-conceptual representation. All these types of knowledge have in common the fact that they can be described according to the dependency paradigm, rather than according to the constituency paradigm. The many applications (in particular those referring to the Penn Treebank) which use heuristics-based translation schemes from the phrase structure to lexical dependency (&quot;head percolation tables&quot;) (Rambow et al., 2002) show that the access to comprehensive and accurate extended dependency-based representations has to be currently considered as a critical issue for the development of robust and accurate NLP technologies.</Paragraph> <Paragraph position="14"> Now we define our proposal for the representation of the RS in treebank annotation.</Paragraph> </Section> class="xml-element"></Paper>