File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0804_intro.xml
Size: 10,565 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0804"> <Title>International Standard for a Linguistic Annotation Framework</Title> <Section position="3" start_page="1" end_page="2" type="intro"> <SectionTitle> 2 Background and rationale </SectionTitle> <Paragraph position="0"> The standardization of principles and methods for the collection, processing and presentation of language resources requires a distinct type of activity. Basic standards must be produced with wide-ranging applications in view. In the area of language resources, these standards should provide various technical committees of ISO, IEC and other standardizing bodies with the groundwork for building more precise standards for language resource management.</Paragraph> <Paragraph position="1"> The need for harmonization of representation formats for different kinds of linguistic information is critical, as resources and information are more and more frequently merged, compared, or otherwise utilized in common systems. This is perhaps most obvious for processing multi-modal information, which must support the fusion of multimodal inputs and represent the combined and integrated contributions of different types of input (e.g., a spoken utterance combined with gesture and facial expression), and enable multimodal output (see, for example, Bunt and Romary, 2002). However, language processing applications of any kind require the integration of varieties of linguistic information, which, in today's environment, come from potentially diverse sources. We can therefore expect use and integration of, for example, syntactic, morphological, discourse, etc.</Paragraph> <Paragraph position="2"> information for multiple languages, as well as information structures like domain models and ontologies.</Paragraph> <Paragraph position="3"> We are aware that standardization is a difficult business, and that many members of the targeted communities are skeptical about imposing any sort of standards at all. There are two major arguments against the idea of standardization for language resources. First, the diversity of theoretical approaches to, in particular, the an- null (Max-Planck-Institut fur Psycholinguistik), Fabio Vitali (Universite di Bologna), Key-Sun Choi (Korterm), Jean-Michel Borde (Digital Visual), Eric Kow (LORIA).</Paragraph> <Paragraph position="4"> notation of various linguistic phenomena suggests that standardization is at least impractical, if not impossible. Second, it is feared that vast amounts of existing data and processing software, which may have taken years of effort and considerable funding to develop, will be rendered obsolete by the acceptance of new standards by the community. Recognizing the validity of both of these concerns, WG1-1 does not seek to establish a single, definitive annotation scheme or format. Rather, the goal is to provide a framework for linguistic annotation of language resources that can serve as a reference or pivot for different annotation schemes, and which will enable their merging and/or comparison. To this end, the work of WG1-1 includes the following: o analysis of the full range of annotation types and existing schemes, to identify the fundamental structural principles and content categories; o instantiation of an abstract format capable of capturing the structure and content of linguistic annotations, based on the analysis in (1); o establishment of a mechanism for formal definition of a set of reference content categories which can be used &quot;off the shelf&quot; or serve as a point of departure for precise definition of new or modified categories. null o provision of both a set of guidelines and principles for developing new annotation schemes and concrete mechanisms for their implementation, for those who wish to use them.</Paragraph> <Paragraph position="5"> By situating all of the standards development squarely in the framework of XML and related standards such as RDF, DAML+OIL, etc., we hope to ensure not only that the standards developed by the committee provide for compatibility with established and widely accepted web-based technologies, but also that transduction from legacy formats into XML formats conformant to the new standards is feasible.</Paragraph> <Paragraph position="6"> 3 General requirements for a linguistic annotation framework The following general requirements for a linguistic annotation framework were identified by the group of experts at Pont-a-Mousson: Expressive adequacy The framework must provide means to represent all varieties of linguistic information (and possibly also other types of information). This includes representing the full range of information from the very general to information at the finest level of granularity.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> Media independence </SectionTitle> <Paragraph position="0"> The framework must handle all potential media types, including text, audio, video, image, etc. and should, in principle, provide common mechanisms for handling all of them. The framework will rely on existing or developing standards for representing multi-media.</Paragraph> <Paragraph position="1"> Semantic adequacy o Representation structures must have a formal semantics, including definitions of logical operations o There must exist a centralized way of sharing descriptors and information categories Incrementality o The framework must provide support for various stages of input interpretation and output generation. o The framework must provide for the representation of partial/under-specified results and ambiguities, alternatives, etc. and their merging and comparison. Uniformity Representations must utilize same &quot;building blocks&quot; and the same methods for combining them.</Paragraph> <Paragraph position="2"> Openness The framework must not dictate representations dependent on a single linguistic theory.</Paragraph> <Paragraph position="3"> Extensibility The framework must provide ways to declare and interchange extensions to the centralized data category registry. null Human readability Representations must be human readable, at least for creation and editing.</Paragraph> <Paragraph position="4"> Information in an annotation scheme must be explicit--that is, the burden of interpretation should not be left to the processing software.</Paragraph> <Paragraph position="5"> Consistency Different mechanisms should not be used to indicate the same type of information.</Paragraph> <Paragraph position="6"> To fulfill these requirements, it is necessary to identify a consistent underlying data model for data and its annotations. A data model is a formalized description of the data objects (in terms of composition, attributes, class membership, applicable procedures, etc.) and relations among them, independent of their instantiation in any particular form. A data model capable of capturing the structure and relations in diverse types of data and annotations is a pre-requisite for developing a common corpus-handling environment: it impacts the design of annotation schema, encoding formats and data architectures, and tool architectures.</Paragraph> <Paragraph position="7"> As a starting assumption, we can conceive of an annotation as a one- or two-way link between an annotation object and a point (or a list/set of points) or span (or a list/set of spans) within a base data set. Links may or may not have a semantics--i.e., a type--associated with them. Points and spans in the base data may themselves be objects, or sets or lists of objects. We make several observations concerning this assumption: o the model assumes a fundamental linearity of objects in the base, e.g., as a time line (speech); a sequence of characters, words, sentences, etc.; or pixel data representing images; o the granularity of the data representation and encoding is critical: it must be possible to uniquely point to the smallest possible component (e.g., character, phonetic component, pitch signal, morpheme, word, etc.); o an annotation scheme must be mappable to the structures defined for annotation objects in the model; o an encoding scheme must be able to capture the object structure and relations expressed in the model, including class membership and inheritance, therefore requiring a sophisticated means to specify linkage within and between documents; o it is necessary to consider the logistics of identifying spans by enclosing them in start and end tags (thus enabling hierarchical grouping of objects in the data itself), vs. explicit addressing of start and end points, which is required for read-only data; o it must be possible to represent objects and relations in some (fairly straightforward) form that prevents information loss; o ideally, it should be possible to represent the objects and relations in a variety of formats suitable to different tools and applications.</Paragraph> <Paragraph position="8"> ISO TC37/SC 4's goal is to develop a framework for the design and implementation of linguistic resource formats and processes in order to facilitate the exchange of information between language processing modules. A well-defined representational framework for linguistic information will also provide for the specification and comparison of existing application-specific representations and the definition of new ones, while ensuring a level of interoperability between them. The framework should allow for variation in annotation schemes while Note that this observation applies to the fundamental structure of stored data. Because the targets of a relation may be either individual objects, or sets or lists of objects, information with more than one dimension is accommodated.</Paragraph> <Paragraph position="9"> at the same time enabling comparison and evaluation, merging of different annotations, and development of common tools for creating and using annotated data. For this purpose we envisage a common &quot;pivot&quot; format based on a data model capable of capturing all types of information in linguistic annotations, into and out of which site-specific representation formats can be transduced. This strategy is similar to that adopted in the design of languages intended to be reusable across platforms, such as Java. The pivot format must support the communication among all modules in the system, and be adequate for representing not only the end result of interpretation, but also intermediate results.</Paragraph> </Section> </Section> class="xml-element"></Paper>