XML Viewer - w04-2706

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2706_intro.xml
Size: 22,812 bytes
Last Modified: 2025-10-06 14:02:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2706">
  <Title>Deep Syntactic Annotation: Tectogrammatical Representation and Beyond</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 The Extensions
2.1 The Structure: Deep Dependency and Valency
</SectionTitle>
    <Paragraph position="0"> The development of formal theories of grammar has documented that when going beyond the shallow grammatical structure toward some kind of functional or semantic structure, two notions become of fundamental importance: the notion of the head of the structure and the notion of valency (i.e. of the requirements that the heads impose on the structure they are heads of). To adduce just some reflections of this tendency from American linguistic scene, Fillmore's &amp;quot;case grammar&amp;quot; (with verb frames) and his FrameNet ((Fillmore et al., 2003)), Bresnan's and Kaplans's lexical functional grammar (with the distinction between the constituent and the functional structure and with an interesting classification of functions) and Starosta's &amp;quot;lexicase grammar&amp;quot; can serve as the earliest examples. To put it in terms of formal syntactic frameworks, the phrase structure models take on at least some traits of the dependency models of language; Robinson has shown that even though Fillmore leaves the issue of formal representation open, the phrase-structure based sentence structure he proposes can be easily and adequately transposed in terms of dependency grammar. Dependency account of sentence structure is deeply rooted in European linguistic tradition and it is no wonder then that formal descriptions originating in Europe are dependency-based (see Sgall, Kunze, Hellwig, Hudson, Mel'chuk). We understand it as crucial to use sentence representations &amp;quot;deep&amp;quot; enough to be adequate as an input to a procedure of semantic(-pragmatic) interpretation (i.e. representing function words and endings by indexes of node labels, restoring items which are deleted in the morphemic or phonemic forms of sentences and distinguishing tens of kinds of syntactic relations), rather than to be satisfied with some kind of &amp;quot;surface&amp;quot; syntax. The above-mentioned development of formal frameworks toward an inclusion of valency in some way or another has found its reflection in the annotation scenarios that aimed at going beyond the shallow structure of sentences. An important support for annotation conceived in this way can be found in schemes that are based on an investigation of the subcategorization of lexical units that function as heads of complex structures, see. Fillmore's FRAMENET, the PropBank as a further stage of the development of the Penn Treebank (Palmer et al., 2001) and Levin's verb classes (Levin, 1993) on which the LCS Database (Dorr, 2001) is based. There are other systems working with some kind of &amp;quot;deep syntactic&amp;quot; annotation, e.g. the broadly conceived Italian project carried out in Pisa (N. Calzolari, A. Zampolli) or the Taiwanese project MARVS; another related framework is presented by the German project NEGRA, basically surface oriented, with which the newly produced subcorpus TIGER contains more information on lexical semantics. Most work that has already been carried out concerns subcategorization frames (valency) of verbs but this restriction is not necessary: not only verbs but also nouns or adjectives and adverbs may have their &amp;quot;frames&amp;quot; or &amp;quot;grids&amp;quot;.</Paragraph>
    <Paragraph position="1"> One of the first complex projects aimed at a deep (underlying) syntactic annotation of a large corpus is the already mentioned Prague Dependency Treebank (HajiVc, 1998); it is designed as a complex annotation of Czech texts (taken from the Czech National Corpus); the underlying syntactic dependency relations (called functors) are captured in the tectogrammatical tree structures (TGTS); see (HajiVcov'a, 2000). The set of functors comprises 53 valency types subclassified into (inner) participants (arguments) and (free) modifications (adjuncts). Some of the free modifications are further subcategorized into more subtle classes (constituting mainly the underlying counterparts, or meanings, of prepositions).</Paragraph>
    <Paragraph position="2"> Each verb entry in the lexicon is assigned a valency frame specifying which type of participant or modification can be associated with the given verb; the valency frame also specifies which participant/modification is obligatory and which is optional with the given verb entry (in the underlying representations of sentences), which of them is deletable on the surface, which may or must function as a controller, and so on. Also nouns and adjectives have their valency frames.</Paragraph>
    <Paragraph position="3"> The shape of TGTSs as well as the repertory and classification of the types of modifications of the verbs is based on the theoretical framework of the Functional Generative Description, developed by the Prague research team of theoretical and computational linguistics as an alternative to Chomskyan transformational grammar (Sgall et al., 1986). The first two arguments, though labeled by &amp;quot;semantically descriptive&amp;quot; tags ACT and PAT (Actor and Patient, respectively) correspond to the first and the second argument of a verb (cf.</Paragraph>
    <Paragraph position="4"> Tesni`ere's (Tesni`ere, 1959) first and second actant), the other three arguments of the verb being then differentiated (in accordance with semantic considerations) as ADDR(essee), ORIG(in) or EFF(ect); these five functors belong to the set of participants (arguments) and are distinguished from (free) modifications (adjuncts) such as LOC(ative), several types of directional and temporal (e.g. TWHEN) modifications, APP(urtenance), R(e)STR(ictive attribute), DIFF(erence), PREC(eding cotext referred to), etc. on the basis of two basic operational criteria (Panevov'a, 1974), (Panevov'a, 1994): (i) can the given type of modification modify in principle every verb? (ii) can the given type of modification occur in the clause more than once? If the answers to (i) and (ii) are yes, then the modification is an adjunct, if not, then we face an argument.</Paragraph>
    <Paragraph position="5"> We assume that the cognitive roles can be determined on the basis of combinations of the functors with the lexical meanings of individual verbs (or other words), e.g.</Paragraph>
    <Paragraph position="6"> the Actor of buy is the buyer, that of sell is the seller, the Addressee and the Patient of tell are the experiencer and the object of the message, respectively.The valency dictionary created for and used during the annotation of the Prague Dependency Treebank, called PDT-VALLEX, is described in (HajiVc et al., 2003). The relation between function and (morphological) form as used in the valency lexicon is described in (HajiVc and UreVsov'a, 2003).</Paragraph>
    <Paragraph position="7"> An illustration of this framework is presented in Fig. 1.</Paragraph>
    <Paragraph position="8">  tuzemsk'y v'yrobce dostal hlavy o VctyVri dny pozdVeji. 'However, the domestic producer got the heads four days later.' Let us adduce further two examples in which the functors are written in capitals in the otherwise strongly simplified representations, where most of the adjuncts are understood as depending on nouns, whereas the other functors concern the syntactic relations to the verb. Let us note that with the verb arrive the above mentioned test determines the Directional as a (semantically) obligatory item that can be specified by the hearer according to the given context (basically, as here or there):  (1) Jane changed her house from a shabby cottage into a comfortable home.</Paragraph>
    <Paragraph position="9"> (1') Jane.ACT changed her.APP house.PAT from-a-shabby.RSTR cottage.ORIG into-a-comfortable.RSTR home.EFF.</Paragraph>
    <Paragraph position="10"> (2) Yesterday Jim arrived by car.</Paragraph>
    <Paragraph position="11"> (2') Yesterday.TWHEN Jim.ACT arrived here.DIR3 by null car.MEANS.</Paragraph>
    <Paragraph position="12"> A formulation of an annotation scenario based on well-specified subcategorization criteria helps to compare different schemes and to draw some conclusions from such a comparison. In (HajiVcov'a and KuVcerov'a, 2002) the authors attempt to investigate how different frameworks annotating some kind of deep (underlying) syntactic level (the LCS Data, PropBank and PDT) compare with each other (having in mind also a more practical application, namely a machine translation project the modules of which would be &amp;quot;machine-learned&amp;quot;, using a procedure based on syntactically annotated parallel corpora). We are convinced that such a pilot study may also contribute to the discussions on a possibility/impossibility of formulating a &amp;quot;theory neutral&amp;quot; syntactic annotation scheme. The idea of a theory neutral annotation scenario seems to be an unrealistic goal: it is hardly possible to imagine a classification of such a complex subsystem of language as the syntactic relations are, without a well motivated theoretical background; moreover, the languages of the annotated texts are of different types, and the theoretical frameworks the authors of the schemes are used to work with differ in the &amp;quot;depth&amp;quot; or abstractness of the classification of the syntactic relations. However, the different annotation schemes seem to be translatable if the distinctions made in them are stated as explicitly as possible, with the use of operational criteria, and supported by larger sentential contexts. The third condition is made realistic by very large text corpora being available electronically; making the first two conditions a realistic goal is fully in the hands of the designers of the schemes.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Topic/Focus Articulation
</SectionTitle>
      <Paragraph position="0"> Another aspect of the sentence structure that has to be taken into account when going beyond the shallow structure of sentences is the communicative function of the sentence, reflected in its information structure. As has been convincingly argued for during decades of linguistic discussions (see studies by Rooth, Steedman, and several others, and esp. the argumentation in (HajiVcov'a et al., 1998)), the information structure of the sentence (topicfocus articulation, TFA in the sequel) is semantically relevant and as such belongs to the semantic structure of the sentence. A typical declarative sentence expresses that its focus holds about its topic, and this articulation has its consequences for the truth conditions, especially for the differences between meaning proper, presuppositiona and allegations (see (HajiVcov'a, 1993); (HajiVcov'a et al., 1998)).</Paragraph>
      <Paragraph position="1"> TFA often is understood to constitute a level of its own, but this is not necessary, and it would not be simple to determine the relationships between this level and the other layers of language structure. In the Functional Generative Description (Sgall et al., 1986), TFA is captured as one of the basic aspects of the underlying structure, namely as the left-to-right dimension of the dependency tree, working with the basic opposition of contextual boundness; the contextually bound (CB) nodes stand to the left of the non-bound (NB) nodes, with the verb as the root of the tree being either contextually bound or non-bound.</Paragraph>
      <Paragraph position="2"> It should be noted that the opposition of NB/CB is the linguistically patterned counterpart of the cognitive (and pre-systemic) opposition of &amp;quot;given&amp;quot; and &amp;quot;new&amp;quot; information. Thus, e.g. in (3) the pronoun him (being NB), in fact constitutes the focus of the sentence.</Paragraph>
      <Paragraph position="3"> (3) (We met a young pair.) My older companion recognized only HIM.</Paragraph>
      <Paragraph position="4"> In the prototypical case, NB items belong to the focus of the sentence, and CB ones constitute its topic; secondary cases concern items which are embedded more deeply than to depend on the main verb of the sentence, cf. the position of older in (3), which may be understood as NB, although it belongs to the topic (being an adjunct of the CB noun companion).</Paragraph>
      <Paragraph position="5"> In the tectogrammatical structures of the PDT annotation scenario, we work with three values of the TFA attribute, namely t (contextually bound node), c (contextually bound contrastive node) and f (contextually non-bound node). 20,000 sentences of the PDT have already been annotated in this way, and the consistency and agreement of the annotators is being evaluated. It seems to be a doable task to annotate and check the whole set of TGTSs (i.e. 55,000 sentences) by the end of 2004. This means that by that time the whole set of 55,000 sentences will be annotated (and checked for consistency) on both aspects of deep syntactic structure. An algorithm the input of which are the TGTSs with their TFA values and the output of which is the division of the whole sentence structure into the (global) topic and the (global) focus is being formulated.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Coreference
</SectionTitle>
      <Paragraph position="0"> The inclusion into the annotation scheme of the two aspects mentioned above in Sect. 2.1 and 2.2, namely the deep syntactic relations and topic-focus articulation, considerably extends the scenario in a desirable way, toward a more complex representation of the meaning of the sentence. The third aspect, the account of coreferential relations, goes beyond linguistic meaning proper toward what can be called the sense of the utterance (Sgall, 1994).</Paragraph>
      <Paragraph position="1"> Two kinds of coreferential relations have to be distinguished: grammatical coreference (i.e. with verbs of control, with reflexive pronouns, with verbal complements and with relative pronouns) and textual (which may cross sentence boundaries), both endophoric and exophoric.</Paragraph>
      <Paragraph position="2"> Several annotation schemes have been reported at recent conferences (ACL, LREC) that attempt at a representation of coreference relations in continuous texts.</Paragraph>
      <Paragraph position="3"> As an example of an attempt to integrate the treatment of anaphora into a complex deep syntactic scenario, we would like to present here a brief sketch of the scheme realized in the Prague Dependency Treebank. For the time being, we are concerned with coreference relations in their narrower sense, i.e. not covering the so-called bridging anaphora (for a possibility to cover also the latter phenomenon, see (B&amp;quot;ohmov'a, 2004)).</Paragraph>
      <Paragraph position="4"> In the Prague Dependency Treebank, coreference is understood as an asymmetrical binary relation between nodes of a TGTS (not necessarily the same TGTS), or, as the case may be, as a relation between a node and an entity that has no corresponding counterpart in the TGTS(s). The node from which the coreferential link leads, is called an anaphor, and the node, to which the link leads, is called an antecedent.</Paragraph>
      <Paragraph position="5"> The present scenario of the PDT provides three coreferential attributes: coref, cortype and corlemma. The attribute coref contains the identifier of the antecedent; if there are more than one antecedents of one anaphor, the attribute coref includes a sequence of identifiers of the relevant antecedents; since every node of a TGTS has an identifier of its own it is a simple programming task to select the specific information on the antecedent. The attribute cortype includes the information on the type of coreference (the possible values are gram for grammatical and text for textual coreference), or a sequence of the types of coreference, where each element of cortype corresponds to an element of coref. The attribute corlemma is used for cases of a coreference between a node and an entity that has no corresponding counterpart in the TGTS(s): for the time being, there are two possible values of this attribute, namely segm in the case of a coreferential link to a whole segment of the preceding text (not just a sentence), and exoph in the case of an exophoric relation. Cases of reference difficult to be identified even if the situation is taken into account are marked by the assignment of unsp as the lemma of the anaphor. This does not mean that a decision is to be made between two or more referents but that the reference cannot be fully specified even within a broader context.</Paragraph>
      <Paragraph position="6"> In order to facilitate the task of the annotators and to make the resulting structures more transparent and telling, the coreference relations are captured by arrows leading from the anaphor to the antecedent and the types of coreference are distinguished by different colors of the arrows. There are certain notational devices used in cases when the antecedent is not within the co-text (exophoric coreference) or when the link should lead to a whole segment rather than to a particular node. If the anaphor corefers to more than a single node or to a subtree, the link leads to the closest preceding coreferring node (subtree). If there is a possibility to choose between a link to an antecedent or to a postcedent, the link always leads to the antecedent.</Paragraph>
      <Paragraph position="7">  usnese na 'ustavn'im z'akonu, pak se to tVeVzko mVen'i. 'If a country accepts a constitution law, then this is difficult to change.' The manual annotation is made user-friendly by a special module within the TRED editor (HajiVc et al., 2001b) which is being used for all three subareas of annotation.</Paragraph>
      <Paragraph position="8"> In the case of coreference, an automatic pre-selection of nodes relevant for annotation is used, making the process faster.</Paragraph>
      <Paragraph position="9"> Until now, about 30,000 sentences have been annotated as for the above types of coreference relations. One of the advantages of a corpus-based study of a language phenomenon is that the researchers become aware of subtleties and nuances that are not apparent. For those who attempt at a corpus annotation, of course, it is necessary to collect a list of open questions which have a temporary solution but which should be studied more intensively and to a greater detail in the future.</Paragraph>
      <Paragraph position="10"> Another issue the study of which is significant and can be facilitated by an availability of a semantically annotated corpus, is the question of a (finite) mechanism the listener (reader) can use to identify the referents. If the backbone of such a mechanism is seen in the hierarchy (partial ordering) of salience, then it can be understood that this hierarchy typically is modified by the flow of discourse in a way that was specified and illustrated by (HajiVcov'a, 1993), (HajiVcov'a et al., in prep). In the flow of a discourse, prototypically, a new discourse referent emerges as corresponding to a lexical occurrence that carries f; further occurrences carry t or c, their referents being primarily determined by their degrees of salience, although the difference between the lowest degrees of salience reduction, is not decisive. It appears to be possible to capture at least certain aspects of this hierarchy by some (still tentative) heuristic rules, which tie up the increase/decrease of salience with the position of the given item in the topic or in the focus of the given utterance. It should also be remarked that there are certain permanently salient referents, which may be referred to by items in topic (as &amp;quot;given&amp;quot; information) without having a referentially identical antecedent in the discourse. We denote them as carrying t or c, but perhaps it would be more adequate to consider them as being always able to be accommodated (i) by the utterance itself, as especially the indexicals (I, you, here, now, yesterday,. . . ), (ii) by the given culture (democracy, Paris, Shakespeare, don Quijote,. . . ), by universal human experience (sun, sky), or (iii) by the general domain concerned (history, biology,...). null Since every node in the PDT carries one of the TFA values (t, c or f) from which the appurtenance of the given item to the topic or focus of the whole sentence can be determined, it will be possible to use the PDT data and the above heuristics to start experiments with an automatic assignment of coreferential relations and check them against the data with the manual annotation of coreference.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Lexical Semantics
</SectionTitle>
      <Paragraph position="0"> The design of the tectogrammatical representation is such that the nodes in the tectogrammatical tree structure represent (almost) only the autosemantic words found in the written or spoken utterance they represent. We believe that it is thus natural to start distinguishing word senses only at this level (and not on a lower level, such as surface syntax or linearized text).</Paragraph>
      <Paragraph position="1"> Moreover, there is a close relation between valency and word senses. We hypothesize that with a suitable set of dependency relations (both inner participants and free modifications, see Sect. 2.1), there is only one valency frame per word sense (even though synonyms or near synonyms might have different valency frames). The opposite is not true: there can be several word senses with an identical valency frame.</Paragraph>
      <Paragraph position="2"> Although in the detailed valency lexicon VALLEX (Lopatkov'a, 2003), (Lopatkov'a et al., 2003) an attempt has originally been made to link the valency frames to (Czech) EuroWordNet (Pala and VSeveVcek, 1999) senses to prove this point, this has been abandoned for the time being because of the idiosyncrasies in WordNet design, which does not allow to do so properly.</Paragraph>
      <Paragraph position="3"> We thus proceed independently with word sense annotation based on the Czech version of WordNet. Currently, we have annotated 10,000 sentences with word senses, both nouns and verbs. We are assessing now further directions in annotation; due to low inter-annotator agreement, we will probably tend to annotate only over a preselected subset of the WordNet synsets. An approach to building semantic lexicons that is more related to our concept of meaning representation is being prepared in the meantime (Holub and StraVn'ak, 2003).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML