File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0604_intro.xml

Size: 5,846 bytes

Last Modified: 2025-10-06 14:02:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0604">
  <Title>The semantics of markup: Mapping legacy markup schemas to a common semantics</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> The work reported in this paper was carried out as part of the Electronic Metastructure for Endangered Language Data (EMELD) project [emeld.org] (NSF grant 0094934) and the Data-Driven Linguistic Ontology project (NSF grant 0411348). One of the objectives of the EMELD project is the &amp;quot;formulation and promulgation of best practice in linguistic markup of texts and lexicon.&amp;quot; Underlying this objective is the goal of ensuring that the digital language documentation produced by linguists will be truly portable in the sense of Bird and Simons (2003): that it will transcend computer environments, scholarly communities, domains of application, and the passage of time. The project was undertaken on the basis of the following principles:  XML markup provides the best format for the interchange and archiving of endangered language description and documentation.</Paragraph>
    <Paragraph position="1"> No single schema or set of schemas for XML markup can be imposed on all language resources.</Paragraph>
    <Paragraph position="2"> The resources must nevertheless be comparable for searching, drawing inferences, etc. Simons (2003) points out the conflict between the second and third principles, and describes the following set of actions for reconciling them.</Paragraph>
    <Paragraph position="3"> Develop a community consensus on shared ontologies of linguistic concepts that can serve as the basis for interoperation.</Paragraph>
    <Paragraph position="4"> Define the semantics of any particular markup schema by mapping its elements and attributes to the concepts in the shared ontology that they represent.</Paragraph>
    <Paragraph position="5"> Map each individual language resource onto its (partial) semantic interpretation by applying the mapping of its markup schema.</Paragraph>
    <Paragraph position="6"> Perform queries and other knowledge-based operations across resources over these semantic interpretations rather than the original XML documents.</Paragraph>
    <Paragraph position="7"> The EMELD project has already begun work on the first of these action items, the creation of a sharable ontology for language documentation and description, a General Ontology for Linguistic Description (GOLD) [emeld.org/gold] (Farrar and Langendoen, 2003), which is intended to be grounded in a suitable upper ontology such as SUMO (Niles and Pease, 2001) or DOLCE (Masolo et al., 2002). GOLD is itself being written in OWL, the Ontology Web Language (McGuinness and van Harmelen, 2004), for use in Semantic Web applications. Simons (2003, 2004) also provides a 'proof of concept' for an implementation of the remaining three action items as follows.</Paragraph>
    <Paragraph position="8"> Beginning with three dictionaries that used similar but distinct markup based on the Text Encoding Initiative (TEI) guidelines (Sperberg-McQueen and Burnard, 2002), Simons created mappings from their different markup schemas to a common semantics as defined by an RDF Schema (Brickley and Guha, 2004). Such a semantic schema provides a &amp;quot;formal definition ... of the concepts in a particular domain, including types of resources that exist, the properties that can relate pairs of resources, and the properties that can describe a single resource in terms of literal values&amp;quot; (Simons, 2004). This mapping he called a metaschema, a formal definition of how the elements and attributes of a markup schema are to be interpreted in terms of the concepts of the semantic schema. He called the 'language' for writing metaschemas (defined via an XML DTD) a Semantic Interpretation Language (SIL).</Paragraph>
    <Paragraph position="9"> Simons performed the semantic interpretation operation in a two-step process using XSLT, first to create an interpreter for a particular metaschema and then to apply it against a source document to yield the RDF document (repository) that is its semantic interpretation. null Simons then loaded the RDF repositories into a Prolog system to create a merged database of RDF triples and used Prolog's inference engine to query the semantic interpretations. null Simons (2003) describes this implementation as providing a semantics of markup, rather than as devising yet another markup language for semantics. As such, it is in the spirit of efforts such as Sperberg-McQueen et al. (2000), who define the meaning of markup as the set of inferences licensed by it. However, their model does not provide for the general comparison of documents. It is also in the spirit of the proposal for a Linguistic Annotation Framework (LAF) under development by Working Group 1-1 of ISO TC 37 SC 4 [www.tc37sc4.org] (Ide and Romary, 2003; Ide, Romary and de la Clergerie, 2003), but differs from it in some significant ways. For example, our strategy does not require that the source annotations be mapped to an XML 'pivot format'. On the other hand, the LAF does not require that the source annotations be in XML to begin with. The 'data categories' of the LAF correspond to the concepts in GOLD; however the &amp;quot;creation of an ontology of annotation classes and types&amp;quot; is not yet part of the LAF (Ide, Romary and de la Clergerie 2003). Moreover, the LAF data model is confined to feature structures, whereas GOLD plans to offer feature structures as one of several data structuring alternatives. Finally, through its connection with an upper ontology, GOLD will also be related to the 'rest of the world', whereas the LAF ontology is apparently intended for linguistic structure only.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML