File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1901_metho.xml
Size: 8,220 bytes
Last Modified: 2025-10-06 14:08:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1901"> <Title>Outline of the International Standard Linguistic Annotation Framework</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 LAF overview </SectionTitle> <Paragraph position="0"> LAF development has proceeded by first identifying an abstract data model that can formally describe linguistic annotations, distinct from any particular representation (as defined in the previous section). Development of this model has been discussed extensively within the language engineering community and tested on a variety of annotation types (see Ide and Romary, 2001a, 2001b, 2002).</Paragraph> <Paragraph position="1"> The data model forms the core of the framework by serving as the reference point for all annotation representation schemes.</Paragraph> <Paragraph position="2"> The overall design of LAF is illustrated in Figure 1. The fundamental principle is that the user controls the representation format for linguistic annotations, which is mappable to the data model. This mapping is accomplished via a rigid &quot;dump&quot; format, isomorphic to the data model and intended primarily for machine rather than human use.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Dump format specification </SectionTitle> <Paragraph position="0"> The data model is built around a clear separation of the structure of annotations and their content, that is, the linguistic information the annotation provides. The model therefore combines a structural meta-model, that is, an abstract structure shared by all documents of a given type (e.g. syntactic annotation), and a set of data categories associated with the various components of the structural metamodel. null The structural component of the data model is a feature structure graph capable of referencing n-dimensional regions of primary data as well as other annotations. The choice of this model is indicated by its almost universal use in defining general-purpose annotation formats, including the Generic Modeling Tool (GMT) (Ide and Romary, 2001, 2002) and Annotation Graphs (Bird and Liberman, 2001). A small inventory of logical operations over annotation structures is specified, which define the model's abstract semantics. These operations allow for expressing the following relations among annotation fragments: * Parallelism: two or more annotations refer to the same data object; * Alternatives: two or more annotations comprise a set of mutually exclusive alternatives (e.g., two possible part-of-speech assignments, before disambiguation); * Aggregation: two or more annotations comprise a list (ordered) or set (unordered) that should be taken as a unit.</Paragraph> <Paragraph position="1"> The feature structure graph is a graph of elementary structural nodes to which one or more data category/value pairs are attached, providing the semantics of the annotation. LAF does not provide definitions for data categories. Rather, to ensure semantic coherence we specify a mechanism for the formal definition of categories and relations, and provide a Data Category Registry of pre-defined categories that can be used directly in annotations. Alternatively, users may define their own data categories or establish variants of categories in the registry; in such cases, the newly defined data categories will be formalized using the same format as definitions available in the registry.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Implementation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Dump format </SectionTitle> <Paragraph position="0"> The dump format is instantiated in XML. Structural nodes are represented as XML elements. The XML-based GMT will serve as a starting point for defining the dump format. Its applicability to diverse annotation types, including terminology, dictionaries and other lexical data (Ide, et al., 2000), morphological annotation (Ide and Romary, 2002) and syntactic annotation (Ide and Romary, 2001b, 2003) demonstrates its generality.</Paragraph> <Paragraph position="1"> As specified by the LAF architecture, the GMT implements a feature structure graph. Structural are provided as grouping tags to handle aggregation (grouping) and alternatives (disjunction), as described above. A <feat> element is used to express category/value pairs. All of these elements are recursively nestable. Therefore, hierarchical relations among annotations and annotation components can be expressed via XML syntax via element nesting. Other relations, including those among discontiguous elements, rely on XML's powerful inter- and intra-document pointing and linkage mechanisms. Because all annotations are stand-off (i.e., in documents separate from the primary data and other annotations), the same mechanisms are used to associate annotations with both &quot;raw&quot; and XML-tagged primary data and with other annotations.</Paragraph> <Paragraph position="2"> The final XML implementation of the dump format may differ slightly from the GMT, in particular where processing concerns (e.g. ease of processing elements vs. attributes vs. content) and conciseness are applied. However, in its general form the above are sufficient to express the information required in LAF. For examples of morphological and syntactic annotation in GMT format, see Ide and Romary, 2001a; 2003; and Ide and Romary, 2001b.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Data Categories </SectionTitle> <Paragraph position="0"> To make them maximally interoperable and consistent with existing standards, RDF schemas can be used to formalize the properties and relations associated with data categories. Instances of the categories themselves will be represented in RDF.</Paragraph> <Paragraph position="1"> The RDF schema ensures that each instantiation of the described objects is recognized as a sub-class of more general classes and inherits the appropriate properties. Annotations will reference the data categories via a URL identifying their instantiations in the Data Category Registry itself. The class and sub-class mechanisms provided in RDFS and its extensions in OWL will also enable creation of an ontology of annotation classes and types.</Paragraph> <Paragraph position="2"> For example, the syntactic feature defined in the ISLE/MILE format for lexical entries (Calzolari, et al. 2003) can be represented in RDF as follows Once declared in the Data Category registry, annotations or lexicons can reference this object directly, for example: For a full example of the use of RDF-instantiated data categories, see Ide, et al., in this volume. Note that RDF descriptions function much like class definitions in an object-oriented programming language: they provide, effectively, templates that describe how objects may be instantiated, but do not constitute the objects themselves. Thus, in a document containing an actual annotation, several objects with the same type may be instantiated, each with a different value. The RDF schema ensures that each instantiation is recognized as a sub-class of more general classes and inherits the appropriate properties.</Paragraph> <Paragraph position="3"> A formally defined set of categories will have several functions: (1) it will provide a precise semantics for annotation categories that can be either used &quot;off the shelf&quot; by annotators or modified to serve specific needs; (2) it will provide a set of reference categories onto which scheme-specific names can be mapped; and (3) it will provide a point of departure for definition of variant or more precise categories. Thus the overall goal of the Data Category Registry is not to impose a specific set of categories, but rather to ensure that the semantics of data categories included in annotations (whether they exist in the Registry or not) are well-defined and understood.</Paragraph> </Section> </Section> class="xml-element"></Paper>