File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1901_intro.xml
Size: 3,361 bytes
Last Modified: 2025-10-06 14:02:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1901"> <Title>Outline of the International Standard Linguistic Annotation Framework</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Terms and definitions </SectionTitle> <Paragraph position="0"> The following terms and definitions are used in the discussion that follows: Annotation: The process of adding linguistic information to language data (&quot;annotation of a corpus&quot;) or the linguistic information itself (&quot;an annotation&quot;), independent of its representation. For example, one may annotate a document for syntax using a LISP-like representation, an XML representation, etc.</Paragraph> <Paragraph position="1"> Representation: The format in which the annotation is rendered, e.g. XML, LISP, etc. independent of its content. For example, a phrase structure syntactic annotation and a dependency-based annotation may both be represented using XML, even though the annotation information itself is very different.</Paragraph> <Paragraph position="2"> Types of Annotation: We distinguish two fundamental types of annotation activity: 1. Segmentation: delimits linguistic elements that appear in the primary data. Including o continuous segments (appear contiguously in the primary data) o super- and sub-segments, where groups of segments will comprise the parts of a larger segment (e.g., a contiguous word segments typically comprise a sentence segment) o discontinuous segments (linked continuous segments) o landmarks (e.g. time stamps) that note a point in the primary data In current practice, segmental information may or may not appear in the document containing the primary data itself. Documents considered to be read-only, for example, might be segmented by specifying byte offsets into the primary document where a given segment begins and ends.</Paragraph> <Paragraph position="3"> 2. Linguistic annotation: provides linguistic information about the segments in the primary data, e.g., a morpho-syntactic annotation in which a part of speech and lemma are associated with each segment in the data. Note that the identification of a segment as a word, sentence, noun phrase, etc. also constitutes linguistic annotation. In current practice, when it is possible to do so, segmentation and identification of the linguistic role or properties of that segment are often combined (e.g., syntactic bracketing, or delimiting each word in the document with an XML tag that identifies the segment as a word, sentence, etc.).</Paragraph> <Paragraph position="4"> Stand-off annotation: Annotations layered over a given primary document and instantiated in a document separate from that containing the primary data. Stand-off annotations refer to specific locations in the primary data, by addressing byte offsets, elements, etc. to which the annotation applies. Multiple stand-off annotation documents for a given type of annotation can refer to the same primary document (e.g., two different part of speech annotations for a given text). There is no requirement that a single XML-compliant document may be created by merging stand-off annotation documents with the primary data; that is, two annotation documents may specify trees over the primary data that contain overlapping hierarchies.</Paragraph> </Section> class="xml-element"></Paper>