File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1612_metho.xml
Size: 17,644 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1612"> <Title>nnotating Anaphoric and Bridging Relations with MMAX</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Annotation Scheme </SectionTitle> <Paragraph position="0"> The first step in the development of an annotation scheme is the definition of the relevant markables, i.e.</Paragraph> <Paragraph position="1"> the class of entities in the text between which the relations to be annotated can possibly occur. It is in terms of these markables (with their attributes) and labelled relations between markables that the annotation itself is expressed.</Paragraph> <Paragraph position="2"> In Section 2, we already roughly defined what counts as a markable by stating that anaphoric and bridging relations hold between specifying expressions. To further distinguish, we introduce the attribute np form which allows to differentiate between the following subclasses: Proper noun, definite NP, indefinite NP, personal pronoun, possessive pronoun and demonstrative pronoun. In addition, other grammatical features of markables, like agreement or the grammatical role they play, might also be of interest. We capture these in two respective attributes, for which we specify a closed list of possible values to be assigned during annotation. These possible values are the combination person/number/gender for the first and subject, object and other for the second attribute.</Paragraph> <Paragraph position="3"> In a given pair of expressions it is the way in which the second expression relates to the first one that determines whether an anaphoric or a bridging relation exists. It is natural, therefore, to represent this information on this second markable and only there. Moreover, this is the only way to allow for the representation of cases in which one markable is antecedent to several others. Since we rule out the possibility of one markable being anaphor or bridging expression to more than one antecedent, this information is easily represented by means of an attribute which identifies the markable as an anaphor or a bridging expression.</Paragraph> <Paragraph position="4"> We add a further attribute for the respective relation's sub-type. For anaphoric expressions, the possible values for this attribute include direct, pronominal and IS-A, and for bridging expressions part-whole, cause-effect and entity-attribute, respectively.</Paragraph> <Paragraph position="5"> Finally, the annotation of an anaphoric or bridging markable has to be complemented with information on which markable is its antecedent. This can be accomplished by supplying the markable with a further attribute. However, selecting the correct antecedent from several candidates can contain a considerable amount of interpretation on the part of the annotator. This is highly undesirable, because it is likely to force arbitrary decisions which in turn can introduce error and inconsistency into the annotation. It would be preferable, therefore, if the explicit identification of the antecedent would be optional. We do this by supplying in our annotation scheme a means to represent cospecification. With this additional representation, the annotation of anaphoric relations in our annotation scheme is a two-step process: Upon encountering an anaphoric markable and setting its general attributes, the markable is first annotated as being cospecifying with all other markables already in this set of cospecifying expressions. This is the only mandatory annotation, and together with the information that the markable is of the anaphoric type it perfectly well represents the anaphoric relation. The second, optional step consists in the specification of the markable's exact antecedent. By separating the annotation of anaphoric relations in this way, the concept antecedent becomes free to be used only in those cases where it is both relevant and unambiguously decidable. It is important to note that no relevant information appears to be lost here: Supplied that the linear order of markables within the text is preserved, it should be possible to establish an antecedent to any anaphoric expression from a set of cospecifying expressions annotated within the scheme described above. Moreover, the important task of evaluating the annotation scheme is not affected either, because common evaluation algorithms for anaphor annotations (Vilain et al., 1995) do not depend on antecedence information, but treat anaphoric expressions as cospecifying equivalence classes.</Paragraph> <Paragraph position="6"> What is even more important is that by the same means we can render optional the explicit specification of bridging antecedents as well. Two cases can be distinguished here: Whenever only a single candidate for antecedence exists, specifying it is trivial. Thus, the only cases where uncertainty as to the correct antecedent of a bridging expression can arise appear to be those in which multiple cospecifying candidates are available. Since bridging (as we define it) is a relation not between lexical items, but between extra-linguistic entities, and since cospecification is a transitive relation, a bridging relation can be sufficiently expressed by specifying any of the candidates. The major difference to the annotation of anaphoric relations is that in case of bridging, the selection of an antecedent is mandatory, but can be made at random, because what is really selected is not the markable but the extra-linguistic entity that it specifies.</Paragraph> </Section> <Section position="5" start_page="2" end_page="7" type="metho"> <SectionTitle> 4 Annotation Tool </SectionTitle> <Paragraph position="0"> This section deals with the question how the annotation scheme developed in the previous section can be implemented in a real annotation tool. Before presenting our own tool MMAX, we briefly review a selection of already existing tools.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Existing Tools </SectionTitle> <Paragraph position="0"> The Discourse Tagging Tool (DTTool) (Aone & Bennett, 1994) is a Tcl/Tk program for the annotation and display of antecedent-anaphor relations in SGMLencoded multilingual texts. While this field of application makes it a potential candidate for the implementation of our scheme as well, this is not the case, mainly because the tool lacks the possibility of assigning arbitrary combinations of attributes and possible values to markables, a feature that obviously is needed for the representation of different types of relations.</Paragraph> <Paragraph position="1"> for coreference annotation. In this case, it is a more structural constraint which prevents our annotation scheme from being implemented in this tool. This constraint results from the fact that CLinkA was built to implement the annotation scheme proposed in the</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> MUC-7 Coreference Task Definition (Hirschman & </SectionTitle> <Paragraph position="0"> Chinchor, 1997). In this scheme, cospecification is expressed in terms of antecedence only, a concept which we have shown to be problematic, and which our annotation scheme therefore avoids. Another problem with CLinkA is that it does not seem to support additional user-defined attributes either.</Paragraph> </Section> <Section position="3" start_page="2" end_page="5" type="sub_section"> <SectionTitle> The Alembic Workbench </SectionTitle> <Paragraph position="0"> is an annotation tool which, among other tasks, directly supports cospecification annotation. In contrast to DTTool and CLinkA, it also allows for the extension of the tag set, so that in principle the handling of different coreference phenomena is possible. The tool (like the other two mentioned before) processes SGML files, into which annotation tags are inserted directly during annotation.</Paragraph> <Paragraph position="1"> We regard this approach to annotation as a drawback, because it mixes the basic data (i.e. the texts to be annotated) with the annotation itself. This can give rise to problems, e.g. in cases where alternative annotations of the same data are to be compared.</Paragraph> <Paragraph position="2"> Referee, a Tcl/Tk program for coreference annotation (DeCristofaro et al., 1999), is better in this respect in that it writes the annotations to a separate file, leaving the annotated text itself unaltered. The format of this annotation file, however, is highly idiosyncratic, rendering very difficult the subsequent analysis of the annotation. Moreover, this tool also represents cospecification in terms of antecedence only, making it impossible to annotate the former without specifying the latter. On the other hand, Referee directly supports the definition of user-definable attributes.</Paragraph> <Paragraph position="3"> Finally, the MATE Workbench is the most ambitious tool that we considered for the implementation of our annotation scheme. It has been developed in Java as a highly customizable tool for the XML-based annotation of arbitrary and possibly non-hierarchical levels of linguistic description. From a theoretical point of view, the MATE Workbench would thus be an ideal platform for the implementation of our annotation scheme. In practical terms, however, we found the performance of the program to be rather poor, rendering it practically unusable as soon as a certain corpus size was reached.</Paragraph> </Section> <Section position="4" start_page="5" end_page="7" type="sub_section"> <SectionTitle> 4.2 MMAX, an XML Annotation Tool </SectionTitle> <Paragraph position="0"> Since we found the existing tools that we considered to be insufficient for the task of implementing our annotation scheme, we decided to develop our own tool.</Paragraph> <Paragraph position="1"> is written in Java for reasons of platform independence. It processes XML-encoded text corpora which make use of standoff annotation (Thompson & McKelvie, 1997). Using this technique allows us to keep apart the basic data and the annotation. XML support in Java is realized by means of the Apache implementations of an XML parser and XSL stylesheet processor.</Paragraph> <Paragraph position="2"> In MMAX, written texts are represented in XML in terms of base-level and supra-base level elements. For each of these element types, Document Type Definitions (DTDs) exist which describe the structure of a well-formed element. In the following, we give DTD fragments and discuss their semantics.</Paragraph> <Paragraph position="3"> A word is the most straightforward base level element for a written text. Apart from the representation of the word itself, each element of this type has an ID attribute which serves to uniquely identify the word within the text.</Paragraph> </Section> </Section> <Section position="6" start_page="7" end_page="8" type="metho"> <SectionTitle> <!ELEMENT words (word*)> <!ELEMENT word (#PCDATA)> </SectionTitle> <Paragraph position="0"> <!ATTLIST word id ID #REQUIRED> The sequence of words which as a whole constitutes the complete text can be divided with respect to two criteria, a formal and a pragmatic one: Each word is part of a particular (formally defined) text, which consists of sentences, which in turn may be grouped into paragraphs. Each sentence has an ID and a span attribute which is a pointer to a sequence of word elements. In addition, every text can have an optional headline, which consists of any number of sentences. The formal structure of a text is described by the fol- null In pragmatic terms, on the other hand, a text can be regarded as a discourse, consisting of a series of discourse segments. Again, each discourse segment has an ID and a span attribute, as well as a function attribute indicative of its communicative function. This pragmatic structure can be translated into a DTD as follows: We use our own attribute here instead of the href attribute as defined in XPointer, because our element differs from the latter both in semantics and implementation. In MMAX, the XML elements representing markables possess a set of attributes which is only partly pre-defined: A closed set of fixed system attributes is complemented by an open set of user-definable attributes which depend on the annotation scheme that is to be implemented.</Paragraph> <Paragraph position="1"> System Attributes. Each markable has an ID attribute which uniquely identifies it. In addition, a span attribute is needed as well which maps the markable to one or more word elements. Finally, we introduce a type attribute the meaning of which will be described in the next subsection. Two additional system attributes serve to express the relations between markables. We argue that two basic relations are sufficient here.</Paragraph> <Paragraph position="2"> The first is an unlabelled and unordered relation between arbitrarily many markables, which can be interpreted as set-membership, i.e. markables standing in this relation to each other are interpreted as constituting a set. Note that the interpretation of this relation is not pre-defined and needs to be specified within the annotation scheme. In order to express a markable's membership in a certain set, a member attribute is introduced which has as its value some string specification. Set membership can thus be established/checked by unifying/comparing the member attribute values of two or more markables.</Paragraph> <Paragraph position="3"> The second is a labelled and ordered relation between two markables, which is interpreted as one markable pointing to the other. Note that here, again, the nature of this pointing is not pre-defined. However, there is a structural constraint imposed on the pointing relation which demands that each markable can point to at most one other markable. Since there is no constraint as to how many different markables can point to another one, n:1 relations can be represented. A pointer attribute is required for the expression of the pointing relation. The range of possible values for this attribute is the range of existing markables' IDs, with the exception of the current markable itself.</Paragraph> <Paragraph position="4"> The DTD fragment for markables and their system attributes looks as follows: User-definable Attributes. It is by means of its user-definable attributes that a markable obtains its semantic interpretation within an annotation scheme. But even within a single scheme, it may be required to discriminate between different types of markables. In MMAX, the type attribute is introduced for this purpose. This attribute does not have any pre-defined pos- null sible values. Instead, a list of these has to be supplied by the annotation scheme. For each of these values, in turn, a list of relevant attributes and possible values has to be defined by the user. Depending on which of the mutually exclusive type attributes is assigned to a given markable during annotation, only the attributes relevant to this type will be offered in a separate attribute window for further specification.</Paragraph> </Section> <Section position="7" start_page="8" end_page="9" type="metho"> <SectionTitle> 5 Implementation </SectionTitle> <Paragraph position="0"> We utilize the system attribute type to discriminate between the three basic classes of markables, i.e. normal null , anaphoric and bridging ones. The respective attributes and possible values for these mutually exclusive markable types can directly be adopted from the annotation scheme. Note that a subset of these is in fact identical for each type (np form, agreement and grammatical role), while other attributes' possible values vary with the type of markable: For anaphoric markables, e.g., the sub-types direct, pronominal and IS-A are relevant, which make no sense for bridging expressions, and vice versa. This is directly supported by the adaptive attribute window. Figure 1 shows the attribute window in response to the selection of a value for the type attribute.</Paragraph> <Paragraph position="1"> Cospecification between two or more markables is expressed by means of an identical member attribute. This value, though at this time realised as a string of the form set XX only, can be interpreted as what has been called universe entity, elsewhere, e.g. in a markable to a set of cospecifying markables is accomplished in two steps: First, the set as a whole is selected by left-clicking any of its members. As a result, all members are displayed in a different color, the selected one in addition being highlighted. The markable to be added is then right-clicked, and the desired action chosen from a popup menu. Figure 2 shows this situation. Note that the attribute window has changed in response to the selection of the value anaphoric for the type attribute. Specifying the antecedent to an anaphoric expression is done as follows: First, the anaphoric markable is selected by left-clicking it. The desired antecedent is then right-clicked. Finally, selecting the appropriate menu item from a popup menu causes the anaphoric markable to point to its antecedent. The antecedent and the anaphoric respectively bridging expression are displayed in a different colour whenever the latter is selected. Note that by combination of the member and pointer attributes, cospecification and bridging can be represented simultaneously, which may be needed in cases of long-distance anaphor and short-distance bridging.</Paragraph> </Section> class="xml-element"></Paper>