File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0804_metho.xml
Size: 12,085 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0804"> <Title>International Standard for a Linguistic Annotation Framework</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Terms and definitions </SectionTitle> <Paragraph position="0"> The following terms and definitions are used in the discussion that follows: Annotation: The process of adding linguistic information to language data (&quot;annotation of a corpus&quot;) or the linguistic information itself (&quot;an annotation&quot;), independent of its representation. For example, one may annotate a document for syntax using a LISP-like representation, an XML representation, etc.</Paragraph> <Paragraph position="1"> Representation: The format in which the annotation is rendered, e.g. XML, LISP, etc. independent of its content. For example, a phrase structure syntactic annotation and a dependency-based annotation may both be represented using XML, even though the annotation information itself is very different.</Paragraph> <Paragraph position="2"> Types of Annotation: We distinguish two fundamental types of annotation activity: 1. segmentation : delimits linguistic elements that appear in the primary data. Including o continuous segments (appear contiguously in the primary data) o super- and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g., a contiguous word segments typically comprise a sentence segment) o discontinuous segments (linking continuous segments) o landmarks (e.g time stamps) that note a point in the primary data In current practice, segmental information may or may not appear in the document containing the primary data itself. Documents considered to be read-only, for example, might be segmented by specifying byte offsets into the primary document where a given segment begins and ends.</Paragraph> <Paragraph position="3"> 2. linguistic annotation: provides linguistic information about the segments in the primary data, e.g., a morpho-syntactic annotation in which a part of speech and lemma are associated with each segment in the data. Note that the identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation. In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that segment are often combined (e.g., syntactic bracketing, or delimiting each word in the document with an XML tag that identifies the segment as a word, sentence, etc.).</Paragraph> <Paragraph position="4"> Stand-off annotation: Annotations layered over a given primary document and instantiated in a document separate from that containing the primary data. Stand-off annotations refer to specific locations in the primary data, by addressing byte offsets, elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can refer to the same primary document (e.g., two different part of speech annotations for a given text). There is no requirement that a single XML-compliant document may be created by merging stand-off annotation documents with the primary data; that is, two annotation documents may specify trees over the primary data that contain overlapping hierarchies.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Design principles </SectionTitle> <Paragraph position="0"> The following general principles will guide the LAF development: o The data model and document form are distinct but mappable to one another o The data model is parsimonious, general, and formally precise o The data model is built around a clear separation of structure and content o There is an inventory of logical operations supported by the data model, which define its abstract semantics o The document form is largely under user control o The mapping between the flexible document form and data model is via a rigid dump-format o The mapping from document form to the dump format is documented in an XML Schema (or the functional equivalent thereof) associated with the document o Mapping is operationalized either via schema-based data-binding process or via schema-derived stylesheet mapping between the user document and the dump-format document.</Paragraph> <Paragraph position="1"> o It must be possible to isolate specific layers of annotation from other annotation layers or the primary (base) data; i.e., it must be possible to create a format using stand-off annotation o The dump format must be designed to enable stream marshalling and unmarshalling The overall architecture of LAF as dictated by these principles is given in Figure 1. The left side of the diagram represents the user-defined document form, and is labeled &quot;human&quot; to indicate that creation and editing, of the resource is accomplished via human interaction with this format. This format should, to the extent possible, be human readable. We will support XML for these formats (e.g., by providing style sheets, examples, etc.) but not disallow other formats. The right side represents the dump format, which is machine processable, and may not be human readable as it is intended for use only in processing. This format will be instantiated in XML.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Practice </SectionTitle> <Paragraph position="0"> The following set of practices will guide the implementation of the LAF: o The data model is essentially a feature structure graph with a moderate admixture of algebra (e.g.</Paragraph> <Paragraph position="1"> disjunction, sets), grounded in n-dimensional regions of primary data and literals.</Paragraph> <Paragraph position="2"> o The dump format is isomorphic to the data model. o Semantic coherence is provided by a registry of features in an XML-compatible format (e.g., RDF), which can be used directly in the user-defined formats and is always used with the dump format.</Paragraph> <Paragraph position="3"> o Resources will be available to support the design and specification of document forms, for example: -XML Schemas in several normal forms based on type definitions and abstract elements that can be exploited via type derivation and/or substitution group; -XPointer design-patterns with standoff semantics; null -Schema annotations specifying mapping between document form and data model; -Meta-stylesheet for mapping from annotated XML Schema to mapping stylesheets; -Data-binding stylesheets with language-specific bindings (e.g. Java).</Paragraph> <Paragraph position="4"> o Users may define their own data categories or establish variants of categories in the registry. In such cases, the newly defined data categories will be formalized using the same format as definitions available in the registry, and will be associated with the dump format.</Paragraph> <Paragraph position="5"> o The responsibility of converting to the dump format is on the producer of the resource.</Paragraph> <Paragraph position="6"> o The producer is responsible for documenting the mapping from the user format to the data model.</Paragraph> <Paragraph position="7"> examples following these guidelines: -The example format should illustrate use of data model/mapping -The examples will show both the left (humanreadable) and right (machine processable) side formats -Examples will be provided that use existing schemes</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> The framework outlined in the previous section provides for the use of any annotation format consistent with the feature structure-based data model that will be used to define the pivot format. This suggests a future scenario in which annotators may create and edit annotations in a proprietary format, transduce the annotations using available tools to the pivot format for interchange and/or processing, and if desired, transduce the pivot form of the annotations (and/or additional annotation introduced by processing) back into the proprietary format. We anticipate the future development of annotation tools that provide a user-oriented interface for specifying annotation information, and which then generate annotations in the pivot format directly. Thus the pivot format is intended to function in the same way as, for example, Java byte code functions for programmers, as a universal &quot;machine language&quot; that is interpreted by processing software into an internal representation suited to its particular requirements. As with Java byte code, users need never see or manipulate the pivot format; it is solely for machine consumption. null Information units or data categories provide the semantics of an annotation. Data categories are the most theory and application-specific part of an annotation scheme. Therefore, LAF includes a Data Category Registry to provide a means to formally define data categories for reference and use in annotation. To make them maximally interoperable and consistent with existing standards, RDF schemas can be used to formalize the properties and relations associated with each data category. The RDF schema ensures that each instantiation of the described objects is recognized as a sub-class of more general classes and inherits the appropriate properties.</Paragraph> <Paragraph position="1"> Annotations will reference the data categories via a URL identifying their instantiations in the Data Category Registry itself. The class and sub-class mechanisms provided in RDFS and its extensions in OWL will also enable creation of an ontology of annotation classes and types.</Paragraph> <Paragraph position="2"> A formally defined set of categories will have several functions: (1) it will provide a precise semantics for annotation categories that can be either used &quot;off the shelf&quot; by annotators or modified to serve specific needs; (2) it will provide a set of reference categories onto which scheme-specific names can be mapped; and (3) it will provide a point of departure for definition of variant or more precise categories. Thus the overall goal of the Data Category Registry is not to impose a specific set of categories, but rather to ensure that the semantics of data categories included in annotations (whether they exist in the Registry or not) are well-defined and understood.</Paragraph> <Paragraph position="3"> The data model that will define the pivot format must be capable of representing all of the information contained in diverse annotation types. The model we assume is a feature structure graph for annotation information, capable of referencing n-dimensional regions of primary data as well as other annotations. The choice of this model is indicated by its almost universal use in defining general-purpose annotation formats, including the Generic Modeling Tool (GMT) (Ide & Romary, 2001, 2002) and Annotation Graphs (Bird & Liberman, 2001). The XML-based GMT could serve as a starting point for defining the pivot format; its applicability to diverse annotation types, including terminology, dictionaries and other lexical data (Ide, et al., 2000), morphological annotation (Ide & Romary, 2001a; 2003) and syntactic annotation (Ide & Romary, 2001b) demonstrates its generality. As specified by the LAF architecture, the GMT implements a feature structure graph, and exploits the hierarchical structure of XML elements and XML's powerful interand intra-document pointing and linkage mechanisms for referencing both &quot;raw&quot; and XML-tagged primary data and its annotations.</Paragraph> <Paragraph position="4"> The provision of development resources, including schemas, design patterns, and stylesheets, will enable annotators and software developers to immediately adapt to LAF. Example mappings, e.g., for XCES-encoded annotations, will also be provided.</Paragraph> </Section> class="xml-element"></Paper>