File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/w98-1102_abstr.xml

Size: 4,895 bytes

Last Modified: 2025-10-06 13:49:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1102">
  <Title>Encoding Linguistic Corpora</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper describes the motivation and design of the Corpus Encoding Standard (CES) (Ide, et al., (1996); Ide, 1998), an encoding standard for linguistic corpora intended to meet the need for the development of standardized encoding practices for linguistic corpora. The CES identifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information). It also provides encoding conventions for more extensive encoding and for linguistic annotation, as well as general architecture for representing corpora annotated for linguistic features. The CES has been developed taking into account several practical realities surrounding the encoding of corpora intended for use in language engineering research and applications. Full documentation of the standard is available on the World Wide Web at http://www.cs.vassar.edu/CES/.</Paragraph>
    <Paragraph position="1"> Introduction Today, corpora are considered to be indispensable to NLP work: they provide information for the creation of other resources (e.g., lexicons), enable the gathering of statistics on real-language use to inform theories and algorithms, and provide the raw materials for testing and training. Their importance is widely acknowledged: the creation of the Linguistic Data Consortium (LDC) in the United States and the European Language Resources Association (ELRA) in Europe shows the commitment of funding agencies on both sides of the Atlantic to gathering and distributing corpora for research use.</Paragraph>
    <Paragraph position="2"> In addition to creating large-scale corpora, it is also necessary to develop standards for their encoding, in order to ensure their usability and, most importantly, reusability in corpus-based NLP work. Many freely available tools for language-related tasks such as segmentation, part of speech tagging, etc., exist, and even more in-house tools exist in labs and research centers. Input and output formats for these tools are rarely, if ever, compatible with each other, nor with the encoding formats in available corpora. Translation among formats is not a matter of simple transduction: sometimes the information needed by a tool does not exist in the data; sometimes it is not unambiguously translatable; sometimes the tool cannot retain information present in the original data and it is lost in processing. As a result, enormous amounts of research time and effort are currently spent massaging data and tools for compatibility. This in itself motivates establishing common encoding formats, to avoid redundant effort. This need has been acknowledged in Europe for several years, through efforts such as EAGLES. Recently, recognizing the amount of time and effort involved in creating and annotating corpora, this need has gained the attention of North American researchers and funders as well (see, in particular, the conclusions of an NSFsponsored international workshop on the future directions of NLP research \[Hovy and Ire, 1998\]).</Paragraph>
    <Paragraph position="3"> Designing a coherent encoding scheme is by no means trivial. It demands, first, the development of a sound model of the data to be represented and all its relevant features and attributes, as well as their structural, logical, linguistic, etc. relationships; together with consideration of processing needs). The format should provide for incremental encoding, allowing for enhancement of data with various kinds of annotation. Very few encoding formats have been designed with such considerations in view, resulting in the proliferation of a variety of encoding schemes (even within a common SGML/XML framework) which are, all too often, poorly designed and ultimately unsuitable for extensive use.</Paragraph>
    <Paragraph position="4"> This paper describes the motivation and design of the Corpus Encoding Standard (CES) (Ide, et al., (1996); Ide, 1998), an enccding standard for linguistic corpora intended to meet the need for the principled development of standardized encoding practices for linguistic corpora. The CES was initiated within the European projects EAGLES (in particular, the EAGLES Text Representation subgroup) and Multext (EU-LRE), together with the Vassar/CNRS collaboration (supported by the U.S. National Science Foundation). The CES has so far been used in several pan-European corpus encoding projects, including PAROLE ~ and TELRI ~, as well as numerous smaller projects in both Europe and North America, and it has recently been adopted as a basis for the TIPSTER document attributes and annotation s .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML