File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0310_intro.xml

Size: 4,876 bytes

Last Modified: 2025-10-06 14:03:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0310">
  <Title>Semantically Rich Human-Aided Machine Annotation</Title>
  <Section position="2" start_page="0" end_page="68" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Corpus tagging is a prerequisite for many machine learning methods in NLP but has the drawbacks of high cost, inter-annotator inconsistency and the insufficient treatment of meaning. A tagging approach that strives to ameliorate all of these drawbacks is semantically rich, human-aided machine annotation (HAMA), implemented in the OntoSem (Ontological Semantics) environment using a toolset called DEKADE: the Development, Evaluation, Knowledge Acquisition and Demonstration Environment of OntoSem.</Paragraph>
    <Paragraph position="1"> In brief, the OntoSem text analyzer takes as input open text and outputs a text-meaning representation (TMR) that represents its meaning using an ontologically grounded, language-independent metalanguage (see Nirenburg and Raskin 2004).</Paragraph>
    <Paragraph position="2"> Since the processing leading up to the production of TMR includes, in addition to semantic analysis proper, preprocessing (roughly, segmentation, treatment of named entities and morphology) and syntactic analysis, the overall annotation of text in this approach includes tags relating to all of the above levels. Since the typical input for analysis in our practice is genuine sentences, which are on average 25 words long and contain all manner of complex phenomena, it is not uncommon for the automatically generated TMRs to contain errors.</Paragraph>
    <Paragraph position="3"> These errors--which can occur at the level of preprocessing, syntactic analysis or semantic analysis--can be corrected manually using the DEKADE environment, yielding &amp;quot;gold standard&amp;quot; output. Making a human the final arbiter in the process means that such long-term complexities as treatment of metaphor, metonymy, PP-attachment, difficult cases of reference resolution and others can be resolved locally while we work on fundamental, implementable automatic solutions.</Paragraph>
    <Paragraph position="4"> In this paper we describe the Onto-Sem/DEKADE environment for the creation of gold standard TMRs, which supports the first ever annotation effort that: * produces structures that can be used as input for both text generators and general reasoning systems: semantically rich representations of the meaning of text written in a language-independent metalanguage; these representations cover entities, propositions, relations, attributes, speaker attitudes, modalities, polarity, discourse relations, time, reference relations, and more; * produces semantic tagging of text largely automatically, thus making more realistic and affordable the tagging of large amounts of text in finite time; * almost fully circumvents the pitfalls of manual tagging, including human tagger errors and inconsistencies; * produces richer semantic annotations than manual tagging realistically could, since manipulating large and complex static knowl- null edge sources would be impossible for humans if starting from scratch (i.e., our methodology effectively turns an essay question into a multiple choice one, with most of the correct answers already provided); * incorporates humans as final arbiters for output of three stages of text analysis (preprocessing, syntactic analysis and semantic analysis), thus maximally leveraging the automated capacity of the system but not requiring of it blanket coverage at this point in its development; * promises to reduce, over time, the dependence on human input because an important side effect of the operation of the human-assisted machine annotation approach is enhancement of the static knowledge resources - the lexicon and the ontology - underlying the OntoSem analyzer, so that the quality of automatic text analysis will grow as the HAMA system operates, leading to an ever improving quality of raw, unedited TMRs; * (as a corollary to the previous point) becomes more cost-efficient over time; and * can be cost-effectively extended to other languages (including less commonly taught languages), with much less work than was required for the first language since many of the necessary resources are languageindependent. null Our approach to text analysis is a hybrid of knowledge-based and corpus-based, stochastic methods.</Paragraph>
    <Paragraph position="5"> In the remainder of the paper we will briefly describe the lay of the land in text annotation (Section 2), the OntoSem environment (Section 3), the DEKADE environment for creating gold-standard TMRs from automatically generated ones (Section 4), the portability of OntoSem to other languages (Section 5), and the broader implications of this R&amp;D effort (Section 6).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML