File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1036_metho.xml

Size: 11,230 bytes

Last Modified: 2025-10-06 14:07:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1036">
  <Title>Abbreviations: CRC - Czech Radio(tele)communications CTV - Czech TV CR - Czech Republic CSF - (CS) Federation</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Outline of the Prague Dependency
Treebank
</SectionTitle>
    <Paragraph position="0"> The Prague Dependency Treebank (PDT) is being built on the basis of the Czech National Corpus (CNC), which grows rapidly in the range of hundreds of millions of word occurrences in journalistic and fiction texts. The PDT scenario comprises three layers of annotation: (i) the morphemic (POS) layer with about 2000 tags for the highly inflectional Czech language; the whole CNC has been tagged by a stochastic tagger (Hajic and Hladk 1997;1998, Bhmov and Hajicov 1999, Hladk 2000) with a success rate of 95%; the tagger is based on a fully automatic morphemic analysis of Czech (Hajic in press); (ii) a layer of 'analytic' (&amp;quot;surface&amp;quot;) syntax (see Hajic 1998): cca 100 000 Czech sentences, i.e. samples of texts (each randomly chosen sample consisting of 50 sentences of a coherent text), taken from CNC, have been assigned dependency tree structures; every word (as well as every punctuation mark) has a node of its own, the label of which specifies its analytic function, i.e. Subj, Pred, Obj, Adv, different kinds of function words, etc. (total of 40 values); no nodes are added that are not in the surface shape of the sentence (except for the root of the tree, carrying the identification number of the sentence); the sentences from CNC are preprocessed by a dependency-based modification of Collins et al.'s (1999) automatic parser (with a success rate of about 80%), followed by a manual tagging procedure that is supported by a special user-friendly software tool that enables the annotators to work with (i.e. modify) the automatically derived graphic representations of the trees; (iii) the tectogrammatical (underlying) syntactic layer: tectogrammatical tree structures (TGTSs) are being assigned to a subset of the set tagged according to (ii); by now, the experimental phase has resulted in 20 samples of 50 sentences each; the TGTSs, based on dependency syntax, are much simpler than structural trees based on constituency (minimalist or other), displaying a much lower number of nodes and a more perspicuous patterning; their basic characteristics are as follows (a more detailed characterization of tectogrammatics and motivating discussion, which cannot be reproduced here, can be found in Sgall et al. 1986; Hajicov et al. 1998): (a) only autosemantic (lexical) words have nodes of their own; function words, as far as semantically relevant, are reflected by parts of complex node labels (with the exception of  coordinating conjunctions); (b) nodes are added in case of deletions on the surface level; (c) the condition of projectivity is met (i.e. no crossing of edges is allowed); (d) tectogrammatical functions ('functors') such as Actor/Bearer, Patient, Addressee, Origin, Effect, different kinds of Circumstantials are assigned; (e) basic features of TFA are introduced; (f) elementary coreference links (both  grammatical and textual) are indicated.</Paragraph>
    <Paragraph position="1"> Thus, a TGTS node label consists of the lexical value of the word, of its '(morphological) grammatemes' (i.e. the values of morphological categories), its 'functors' (with a more subtle differentiation of syntactic relations by means of 'syntactic grammatemes' (e.g. 'in', 'at', 'on', 'under'), of the attribute of Contextual Boundness (see below), and of values concerning intersentential links (see below).</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 From Contextual Boundness to the
</SectionTitle>
    <Paragraph position="0"> Topic and the Focus of the Sentence The dependency based TGTSs in PDT allow for a highly perspicuous notation of sentence structure, including an economical representation of TFA, understood as one of the main aspects of (underlying) sentence structure along with all other kinds of semantically relevant information expressed by grammatical means. TFA is accounted for by one of the following three values of a specific TFA attribute assigned to every lexical (autosemantic) occurrence: t for 'contextually bound' (prototypically in Topic), c for 'contrastive (part of) Topic', or f (non-bound, typically in Focus). The opposition of contextual boundness is understood as the linguistically structured counterpart of the distinction between &amp;quot;given&amp;quot; and &amp;quot;new&amp;quot; information, rather than in a straightforward etymological way (see Sgall, Hajicov and Panevov 1986, Ch. 3). Our approach to TFA, which uses such operational criteria of empirical adequateness as the question test (with the item corresponding to a question word prototypically constituting the focus of the answer), represents an elaboration of older ideas, discussed especially in Czech linguistics since V. Mathesius and J. Firbas, in the sense of an explicit treatment meeting the methodological requirements of formal syntax.</Paragraph>
    <Paragraph position="1"> The following rules determine the  appurtenance of a lexical occurrence to the Topic (T) or to the Focus (F) of the sentence: (a) the main verb (V) and any of its direct dependents belong to F iff they carry index f; (b) every item i that does not depend directly on V and is subordinated to an element of F different from V, belongs to F (where &amp;quot;subordinated to&amp;quot; is defined as the irreflexive transitive closure of &amp;quot;depend on&amp;quot;); (c) iff V and all items k j  directly depending on it carry index t, then those items k j to which some items l m carrying f are subordinated are called 'proxy foci' and the items l m together with all items subordinated to one of them belong to F, where 1 [?] j,m;  (d) every item not belonging to F according to (a) - (c) belongs to T.</Paragraph>
    <Paragraph position="2">  To illustrate how this approach makes it possible to analyze also complex sentences as for their TFA patterns, with neither T nor F corresponding to a single constitutent, let us present the following example, in which (1') is a highly simplified linearized TGTS of (1); every dependent item is enclosed in a pair of parentheses; for the sake of transparency, syntactic subscripts of the parentheses are left out here, as well as subscripts indicating morphological values, with the exception of the two which correspond to function words, i.e. Temp and Necess(ity); Fig. 1. presents the respective tree structure, in which three parts of each node label are specified, namely the lexical value, the syntactic function (with ACT for Actor/Bearer, RSTR for Restrictive, MANN for Manner, and OBJ for Objective), and the TFA  value: (1) Cesk radiokomunikace mus v tomto  roce rychle splatit dluh televiznm divkum. This year, Czech Radiocommunications have quickly to pay their debt to the TV viewers.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
4 Degrees of Salience in a Discourse
</SectionTitle>
    <Paragraph position="0"> During the development of a discourse, in the prototypical case, a new discourse referent emerges as corresponding to a lexical occurrence that carries the index f; its further occurrences in the discourse carry t and are primarily guided by the scale of their degrees of salience. This scale, which was discussed by Hajicov and Vrbov (1982), has to be reflected in a description of the semantico-pragmatic layer of the discourse. In this sense our approach can be viewed as pointing to a useful enrichment of the existing theories of discourse representation (cf. also Kruijffov 1998, Krahmer 1998; Krahmer and Theune 1999).</Paragraph>
    <Paragraph position="1"> In the annotation system of PDT, not only values of attributes concerning sentence structure are assigned, but also values of attributes for coreferential links in the discourse, which capture certain features typical for the linking of sentences to each other and to the context of situation and allow for a tentative characterization of the discourse pattern in what concerns the development of salience degrees during the discourse.</Paragraph>
    <Paragraph position="2"> The following attributes of this kind are applied within a selected part of PDT, called 'model collection' (for the time being, essentially only pronouns such as 'on' (he), including its zero form, or 'ten' (this) are handled in this way): COREF: the lexical value of the antecedent, CORNUM: the serial number of the antecedent, CORSNT: if the antecedent in the same sentence: NIL, if not: PREVi for the i-th preceding sentence.</Paragraph>
    <Paragraph position="3"> An additional attribute, ANTEC, with its value equal to the functor of the antecedent, is used with the so-called grammatical coreference (relative clauses, pronouns such as 'se' (-self), the relation of control).</Paragraph>
    <Paragraph position="4"> On the basis of these attributes (and of further judgments, concerning especially associative links between word occurrences), it is possible to study the referential identity of different word tokens in the flow of the discourse, and thus also the development of salience degrees.</Paragraph>
    <Paragraph position="5"> The following basic rules determining the degrees of salience (in a preliminary formulation) have been designed, with x(r) indicating that the referent r has the salience degree x, and 1 [?] m,n:  (i) if r is expressed by a weak pronoun (or zero) in a sentence, it retains its salience degree after this sentence is uttered: n(r) --&gt; n(r); (ii) if r is expressed by a noun (group) carrying f, then n(r) --&gt; 0(r); (iii) if r is expressed by a noun (group) carrying t or c, then n(r) --&gt; 1(r); (iv) if n(r) --&gt; m(r) in sentence S, then m+2(q) obtains for every referent q that is not  itself referred to in S, but is immediately associated with the item r present here</Paragraph>
    <Paragraph position="7"> an associated object, then n(r) --&gt; n+2(r).</Paragraph>
    <Paragraph position="8"> These rules, which have been checked with several pieces of English and Czech texts, capture such points as e.g. the fact that in the third utterance of Jim met Martin. He immediately started to speak of the old school in Sussex. Jim invited him for lunch the weak pronoun in object can only refer to Martin, whose image has become the most salient referent by being mentioned in the second utterance; on the other hand, the use of such a pronoun also in the subject (in He invited him for lunch) would make the reference unclear.</Paragraph>
    <Paragraph position="9"> Since the only fixed point is that of maximal salience, our rules technically determine the degree of salience reduction (indicating 0 as the maximal salience). Whenever an entity has a salience distinctly higher than all competing entities which can be referred to by the given expression, this expression may be used as giving the addressee a sufficiently clear indication of the reference specification.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML