XML Viewer - w05-0303

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0303_metho.xml
Size: 14,570 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0303">
  <Title>Referential Annotations</Title>
  <Section position="3" start_page="0" end_page="14" type="metho">
    <SectionTitle>
2 Referential Relations
</SectionTitle>
    <Paragraph position="0"> This section introduces the inventory of referential relations adopted in the SYN-RA project. We define referential relations as a cover-term for all contextually dependent reference relations. The inventory of such relations adopted for SYN-RA is inspired by the annotation scheme first developed in the MATE project (Davies et al., 1998). However, it takes a cautious approach in that it only adopts those referential relations from MATE for which the developers of MATE report a sufficiently high level of inter-annotator agreement (Poesio et al., 1999).</Paragraph>
    <Paragraph position="1"> SYN-RA currently uses the following subset of relations: coreferential, anaphoric, cataphoric, bound, split antecedent, instance, and expletive. The potential markables are definite NPs, personal pronouns, relative, reflexive, and reciprocal pronouns, demonstrative, indefinite and possessive pronouns.</Paragraph>
    <Paragraph position="2"> There is a second research effort under way at the European Media Laboratory Heidelberg, which also annotates German text corpora and dialog data with referential relations. Since their corpora are not publicly available, it is difficult to verify their inventory of referential relations. Kouchnir (2003) has used their data and describes the relations anaphoric, coreferential, bridging, and none.</Paragraph>
    <Paragraph position="3"> Following van Deemter and Kibble (2000), we define a coreference relation to hold between two  NPs just in case they refer to the same extra-linguistic referent in the real world. In the following example, a coreference relation exists between the noun phrases [1] and [2], and an anaphoric relation between the noun phrase [2] and the personal pronoun [3]. Since noun phrases [1] and [2] are coreferential, all three NPs belong to the same coreference chain. In keeping with the MUC-6 annotation standard3, we establish the anaphoric relations of a pronoun only to its most recently mentioned antecedent.</Paragraph>
    <Section position="1" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
Stimmen
votes
</SectionTitle>
      <Paragraph position="0"> gewahlt.</Paragraph>
      <Paragraph position="1"> elected.</Paragraph>
      <Paragraph position="2"> 'The new chairman of the union of educators and scholars is called Ulli Thone. He was elected yesterday with 217 of 355 votes.' Cataphoric relations hold between a preceding pronoun and a following antecedent within the same sentence, even if this antecedent has already been mentioned within the preceding text. An example for a cataphoric relation is shown in (2).  'They have already been in Berlin for four weeks, the 200 Albanians from Kosovo.' The relation bound holds between anaphoric expressions and quantified noun phrases as their antecedents (see example (3)).</Paragraph>
      <Paragraph position="3">  The split antecedent relation holds between co-ordinate NPs/plural pronouns and pronouns/definite NPs referring to one member of the plural expression. In example (4), the indefinite pronoun beide enters into two split antecedent relations, with noun phrases 1 and 2.</Paragraph>
      <Paragraph position="4">  'But suddenly, there is a completely implausible and grotesque phone call from the detective to the mother of the victim, they both cry at each other for several minutes, ...' An instance relation exists between a preceding/following pronoun and its NP antecedent when the pronoun refers to a particular instantiation of the class identified by the NP.</Paragraph>
      <Paragraph position="5">  beibringe.</Paragraph>
      <Paragraph position="6"> teaches.</Paragraph>
      <Paragraph position="7"> 'The conservative powers are just waiting to bombard him with sentences like the one about the 16 middle-distance runners who he is teaching the double full-back formation in four weeks.'  In sentence (5), the relation between the two bracketed NPs is an example of such an instance relation since the second NP is a particular instantiation of the referent denoted by the first NP. A third person singular neuter pronoun es is marked as expletive if it has no proper antecedent. This is the case for presentational es in example (6), impersonal passive as in example (7), or es as sub-ject for verbs without an agent as in example (8).  ihn.</Paragraph>
      <Paragraph position="8"> him.</Paragraph>
      <Paragraph position="9"> 'He is in a bad way.' Apart from expletive uses of es and anaphoric uses with an NP antecedent, the pronoun es can also be used in cases of event anaphora as in sentence (9). Here es refers to the event of Jochen's winning the lottery. Currently, the annotation in SYN-RA is restricted to NP anaphora and therefore event anaphors such as in sentence (9) remain unannotated for anaphora.</Paragraph>
      <Paragraph position="10">  'Jochen has won the lottery. But he does not know it yet.' The annotation of such relations is performed manually with the annotation tool MMAX (Muller and Strube, 2003). Its graphical user interface allows for easy selection of the relevant markables and the accompanying relation between the contextually dependent expression and its antecedent.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="14" end_page="16" type="metho">
    <SectionTitle>
3 Automatic Extraction of Markables and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="14" end_page="16" type="sub_section">
      <SectionTitle>
of Semantic Information
</SectionTitle>
      <Paragraph position="0"> Annotation of referential relations involves two main tasks: the identification of markables, i.e., identifying the class of expressions that can enter into referential relations, and the identification of the particular referential relations that two or more expressions enter into. Identification of markables requires at least partial syntactic annotation of the text. If referential relations need to be annotated from plain text, then markables must be identified semi-automatically from the output of a chunker or full parser, if available, or otherwise completely manually. However, in each of these two scenarios, identification of markables is a time-consuming process. In case of semi-automatic annotation, the effort required depends on the quality of the parser, but will require at least some amount of manual postcorrection of the parser output.</Paragraph>
      <Paragraph position="1"> Identification of markables is considerably easier for treebank data since treebanks already provide the necessary syntactic information. For German, there are currently two large-scale treebanks available: the NEGRA/TIGER (Brants et al., 2002) treebank and the Tubingen treebanks for spoken and written German (Stegmann et al., 2000; Telljohann et al., 2003).</Paragraph>
      <Paragraph position="2"> All the treebanks were annotated with the help of the annotation tool Annotate (Plaehn, 1998). The tree-bank annotations are available in the Annotate export format (Brants, 1997) and in an XML format.</Paragraph>
      <Paragraph position="3"> The SYN-RA project is based on the Tubingen treebank of written German (TuBa-D/Z). This tree-bank uses as its data source a collection of articles of the German daily newspaper taz (die tageszeitung).</Paragraph>
      <Paragraph position="4"> The treebank currently comprises appr. 15 000 sentences, with a new release of 7 000 additional sentences scheduled for June of this year.</Paragraph>
      <Paragraph position="5"> Due to its fine grained syntactic annotation, the TuBa-D/Z treebank data are ideally suited as a basis for the identification of markables and for extracting relevant syntactic and semantic properties for each markable. The TuBa-D/Z annotation scheme distinguishes four levels of syntactic constituency: the lexical level, the phrasal level, the level of topological fields, and the clausal level. The primary ordering principle of a clause is the inventory of topological fields, which characterize the word or- null der regularities among different clause types of German and which are widely accepted among descriptive linguists of German (cf. e.g. (Drach, 1937; Hohle, 1986)). The TuBa-D/Z annotation relies on a context-free backbone (i.e. proper trees without crossing branches) of phrase structure combined with edge labels that specify the grammatical function of the phrase in question.</Paragraph>
      <Paragraph position="6"> Figure 1 shows an example tree from the TuBa-D/Z treebank for sentence (10). The sentence is divided into two clauses (SIMPX), and each clause is subdivided into topological fields. The main clause is made up of the following fields: VF (mnemonic for: Vorfeld - 'initial field') contains the sentenceinitial, topicalized constituent. LK (for: linke Satzklammer - 'left sentence bracket') is occupied by the finite verb. MF (for: Mittelfeld - 'middle field') contains adjuncts and complements of the main verb.</Paragraph>
      <Paragraph position="7"> NF (for: Nachfeld - 'final field') contains extraposed material - in this case an indirect yes/no question. The subordinate clause is again divided into three topological fields: C (for: Komplementierer 'complementizer'), MF, and VC (for: Verbalkomplex - verbal complex). Edge labels are rendered in boxes and indicate grammatical functions. The sentence-initial NX (for: noun phrase) is marked as OA (for: accusative complement), the pronouns sie in the main and subordinate clause as ON (for: nominative complement).</Paragraph>
      <Paragraph position="8">  glaube.</Paragraph>
      <Paragraph position="9"> believes.</Paragraph>
      <Paragraph position="10"> 'They asked their fellow student Cassie Bernall whether she believes in God.' Topological field information and grammatical function information is crucial for anaphora resolution since binding-theory constraints crucially rely on sentence-structure (if the binding theory principles are stated configurationally (Chomsky, 1981)) or on argument-obliqueness (if the binding theory principles are stated in terms of argument structure, as in (Pollard and Sag, 1994)). In the case at hand, the subject pronoun of the main clause, sie, cannot be anaphorically related to the object NP Ihre Schulkameradin Cassie Bernall since they are co-arguments of the same verb. However, the possessive pronoun ihre and the subject pronoun sie of the subordinate clause, can be and, in fact, are anaphorically related, since they are not co-arguments of the same verb. This can be directly inferred from the treebank annotation, specifically from the sentence structure and the grammatical function information  encoded on the edge labels. Most published computational algorithms of anaphora resolution, including (Hobbs, 1978; Lappin and Leass, 1994; Ingria and Stallard, 1989), rely on such binding-constraint filters to minimize the set of potential antecedents for pronouns and reflexives.</Paragraph>
      <Paragraph position="11"> As already pointed out, the sample sentence contains four markables: one possessive pronoun Ihre, two occurrences of the pronoun sie and one complex NP Ihre Schulkameradin Cassie Bernall. The latter NP is a good example of SYN-RA's longest-match principle for identifying markables. In case of complex NPs, the entire NP counts as a markable, but so do its subconstituents - in the case at hand, particularly the possessive pronoun ihre. All of this information can be directly derived from the treebank account. Compared to other annotation efforts for German where markables have to be chosen manually (Muller and Strube, 2003), manual annotation in the SYN-RA project can, thus, be restricted to the selection of the appropriate referential relations between referentially dependent expressions and their nominal antecedents.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="16" end_page="16" type="metho">
    <SectionTitle>
4 The Unified, XML-based Annotation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
Scheme
</SectionTitle>
      <Paragraph position="0"> The annotation of referential expressions is embedded in a unified format which also contains morphological, syntactic, and semantic information.</Paragraph>
      <Paragraph position="1"> The annotation scheme is represented in XML, the widely acknowledged standard for exchanging data, which guarantees portability and re-usability of the data. Each sentence, as well as all words and all nodes in the syntactic structure, are assigned a unique ID. These IDs are used in the annotation of referential relations. The annotation of the treebank sentence 11976 (cf. example (10)) is shown in Figure 2.</Paragraph>
      <Paragraph position="2"> The sentence number is encoded as the ID of the sentence. The first word, Ihre, has an anaphoric relation to a noun phrase in the previous sentence. This relation is marked in the element anaphora, which gives the antecedent as node 517 of sentence 11975, i.e. the previous sentence. The other two anaphoric relations are sentence-internal, the first personal pronoun sie having Ihre (id: s11976w0) as antecedent, the second one the noun phrase Ihre Schulfreundin Cassie Bernall (id: s11976n513). The annotation of the first personal pronoun is an example for the annotation of an anaphoric chain. Ihre and sie belong to the same chain. However, in order to facilitate the extraction of direct relations, such chains are represented in a way that each anaphoric expression refers to the last occurrence of an antecedent.</Paragraph>
      <Paragraph position="3"> The SYN-RA scheme is very similar to the MUC-6 coreference annotation scheme4 but it is more powerful in two respects: As described above, the inventory is not restricted to coreference and anaphoric relations, it also covers e.g. instance relations or split antecedent relations. The latter relation is also the reason for encoding the relational information as XML elements, and not as attributes of a word or a node. If an anaphor enters into a split antecedent relation, it has more than one distinct antecedent. In this case, the element anaphora has two (or more) relations. Such an example is graphically displayed for sentence (4) in Figure 3. The relevant XML representation of the complex entry for the word beide is shown in Figure 4.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML