File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2718_metho.xml
Size: 8,197 bytes
Last Modified: 2025-10-06 14:10:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2718"> <Title>A Standoff Annotation Interface between DELPH-IN Components</Title> <Section position="4" start_page="0" end_page="98" type="metho"> <SectionTitle> 3 Standoff Annotation Framework (SAF) </SectionTitle> <Paragraph position="0"> Our standoff annotation framework borrows heavily from the MAF proposal. The key components of our framework are (i) grounding in primary linguistic data via flexible standoff pointers, (ii) dec- null oration of individual annotations with structured content, (iii) representation of structural ambiguity via a lattice of annotations and (iv) a structure of intra-annotation dependencies. In each case we have generalized heavily in order to apply the SAF framework to a wide domain. The basic unit of the SAF framework is the annotation. An annotation possesses properties as outlined below.</Paragraph> <Paragraph position="1"> Each annotation describes a given span in the raw linguistic data. This span is specified by from and to standoff pointers. In order to cope with different possible data formats (e.g. audio files, XML text, pdf files) we make the pointer scheme a prop-erty of each individual SAF object. So annotations with respect to an audio file may use frame offsets, whilst for an XML text file we may use character (or more sophisticated xpoint-based) pointers. When processing XML text files, we have found it easiest to work with a hybrid approach to the standoff pointer scheme. Existing non-XMLaware processing components can often be easily adapted to produce (Unicode) character pointers; for XML-aware components it is easier to work with XML-aware pointing schemes - here we use an extension of the xpoint scheme described in the XPath specification3. A mapping between these two sets of points provides interconversion sufficient for our needs.</Paragraph> <Paragraph position="2"> Each annotation possesses a content, providing a (structured) description of the linguistic data covered by the annotation. E.g. the content of an annotation describing a token may be the text of the token itself (see fig. 1); the content of an annotation describing a named entity may be a feature structure describing properties of the entity (see fig. 4); the content of an annotation describing the semantics of a sentence may be an RMRS description (see fig. 6). In most cases we describe this content via a simple text string, or a feature structure following the TEI/ISO specification4. But in some cases other representations are more appropriate (such cases are signalled by the type prop-erty on annotations). The content will generally contain meta-information in addition to the pure content itself. The precise specification for the content of different annotation types is a current thrust of development.</Paragraph> <Paragraph position="3"> Each annotation lives in a global lattice. Use of a lattice (consisting of a set of nodes - including a special start node and end node - and a set of edges each with a source node and a target node) allows us to handle the ambiguity seen in linguistic analyses of natural languages. E.g. an automatic speech recognition system may output a word lattice, and a lattice representation can be very useful in other contexts where we do not wish to collapse the space of alternative hypotheses too early.</Paragraph> <Paragraph position="4"> Fig. 2 shows a Norwegian sentence5 for which the token lattice is very useful. Here the possessive s clitic may attach to any word, but unlike in English no apostrophe is used. Hence it not feasible for the tokenizer to resolve this ambiguity in tokenisation. The token lattice (produced by a regex-based SAF-aware preprocessor) provides an elegant solution to this problem: between nodes 2 and 4 (and nodes 4 and 6) the lattice provides alternative paths.6 The parser is able to resolve the am- null available to the preprocessor component. See fig.</Paragraph> <Paragraph position="5"> 5 for a simple representation of the token lattice, and fig. 1 for the equivalent SAF XML.</Paragraph> <Paragraph position="6"> Each annotation also lives in a hierarchy of annotation dependencies built over the lattice. E.g.</Paragraph> <Paragraph position="7"> sentence splitting may be the lowest level; then from each sentence we obtain a set (lattice) of tokens; for individual tokens (or each set of tokens on a partial path through the lattice) we may obtain an analysis from a named-entity component.</Paragraph> <Paragraph position="8"> A parser may build on top of this, producing perhaps a semantic analysis for certain paths in the lattice. Each such level consists of a set of annotations each of which may be said to build on a set of lower annotations. This is encoded by means of a depends on property on each annotation. The annotation in fig. 6 exhibits the use of the depends on property to mark its dependency on the annotation shown in fig. 3.</Paragraph> <Paragraph position="9"> A number of well-formedness constrains apply to SAF objects. For example, the ordering of standoff pointers must be consistent with the ordering of annotation elements through all paths in the lattice. Sets of annotations related (directly or indirectly) via the depends on property must lie on a single path through the lattice.</Paragraph> </Section> <Section position="5" start_page="98" end_page="99" type="metho"> <SectionTitle> 4 XML Serialization </SectionTitle> <Paragraph position="0"> Our SAF XML serialization is provided both for inter-component communication and for persistent storage. XML provides a clean standards-based framework in which to serialize our SAF objects. Our serialization was heavily influenced by the MAF XML serialization.</Paragraph> <Paragraph position="1"> The SAF XML serialization is contained within the top saf XML element. Here the pointer addressing scheme used (e.g. char for character point offsets, xpoint for our xpoint-based scheme), and the location of the primary data are specified as attributes. This element may contain an optional olac element7 to specify metadata (e.g. creator) and a single fsm element holds the rest of the object (as shorthand we also allow a sequence of the annot elements defined below in place of the fsm). The fsm element consists of a number of state elements (with attribute id) declaring the available lattice nodes, followed by annot annotation definitions.</Paragraph> <Paragraph position="2"> Each annotation (annot) element possesses the following attributes: from and to give stand-off pointers into the primary data, encoded according to the scheme specified by the saf element's addressing attribute; source and target each give a state id (absent if the annotations are listed sequentially outside of an fsr element); deps is a set of idrefs; value is shorthand for a string-valued content; type is shorthand for a particular type of annotation content. The annotation content, if not a value string, is represented using the TEI/ISO FSR XML format or the appropriate XML format corresponding to the annotation type.</Paragraph> </Section> <Section position="6" start_page="99" end_page="99" type="metho"> <SectionTitle> 5 Summary </SectionTitle> <Paragraph position="0"> We are in the process of SAF-enabling a number of the DELPH-IN processing components.</Paragraph> <Paragraph position="1"> 7http://www.language-archives.org/OLAC/metadata.html A SAF-aware sentence splitter produces SAF XML describing the span of each sentence, from which a SAF-aware (and XML-aware) preprocessor/tokeniser maps raw sentence text into a SAF XML token lattice (with some additional annotation to describe tokens such as digit sequences). External preprocessor components (such as a morphological analyser for Japanese) may also be manipulated in order to provide SAF input to the parser. SAF is integrated into the parser of the LKB grammar development environment (Copestake, 2002) and can also be used with the PET run-time parser (Callmeier, 2000). The MAF XML format (compatible with SAF) is also integrated into the HOG system, and we hope to generalize this to the full SAF framework.</Paragraph> </Section> class="xml-element"></Paper>