File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0310_metho.xml
Size: 19,736 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0310"> <Title>Semantically Rich Human-Aided Machine Annotation</Title> <Section position="3" start_page="68" end_page="69" type="metho"> <SectionTitle> 2 The Lay of the Land in Annotation </SectionTitle> <Paragraph position="0"> In addition to the well-known bottlenecks of cost and inconsistency, it is widely assumed that low-level (only syntactic or &quot;light semantic&quot;) tagging is either sufficient or inevitable due to the complexity of semantic tagging. Past and ongoing tagging efforts share this point of departure.</Paragraph> <Paragraph position="1"> Numerous projects have striven to achieve text annotation via a simpler task, like translation, sometimes assuming that one language has already been tagged (e.g., Pianta and Bentivogli 2003, and references therein). But results of such efforts are either of low quality, light semantic depth, or remain to be reported. Of significant interest is the porting of annotations across languages: for example, Yarowsky et al. 2001 present a method for automatic tagging of English and the projection of the tags to other languages; however, these tags do not include semantics.</Paragraph> <Paragraph position="2"> Post-editing of automatic annotation has been pursued in various projects (e.g., Brants 2000, and Marcus et al. 1993). The latter group did an experiment early on in which they found that &quot;manual tagging took about twice as long as correcting [automated tagging], with about twice the inter-annotator disagreement rate and an error rate that was about 50% higher&quot; (Marcus et al. 1993). This conclusion supports the pursuit of automated tagging methods. The difference between our work and the work in the above projects, however, is that syntax for us is only a step in the progression toward semantics.</Paragraph> <Paragraph position="3"> Interesting time- and cost-related observations are provided in Brants 2000 with respect to the manual correction of automated POS and syntactic tagging of a German corpus (semantics is not addressed). Although these tasks took approximately 50 seconds per sentence, with sentences averaging 17.5 tokens, the actual cost in time and money puts each sentence at 10 minutes, by the time two taggers carry out the task, their results are compared, difficult issues are resolved, and taggers are trained in the first place. Notably, however, this effort used students as taggers, not professionals. We, by contrast, use professionals to check and correct TMRs and thus reduce to practically zero the training time, the need for multiple annotators (provided the size of a typical annotation task is commensurate with those in current projects), and costly correction of errors.</Paragraph> <Paragraph position="4"> Among past projects that have addressed semantic annotation are the following: 1. Gildea and Jurafsky (2002) created a stochastic system that labels case roles of predicates with either abstract (e.g., AGENT, THEME) or domain-specific (e.g., MESSAGE, TOPIC) roles. The system trained on 50,000 words of hand-annotated text (produced by the FrameNet project). When tasked to segment constituents and identify their semantic roles (with fillers being undisambiguated textual strings) the system scored in the 60's in precision and recall. Limitations of the system include its reliance on hand-annotated data, and its reliance on prior knowledge of the predicate frame type (i.e., it lacks the capacity to disambiguate productively).</Paragraph> <Paragraph position="5"> Semantics in this project is limited to case-roles.</Paragraph> <Paragraph position="6"> 2. The goal of the &quot;Interlingual Annotation of Multilingual Text Corpora&quot; project (http://aitc.aitcnet.org/nsf/iamtc/) is to create a syntactic and semantic annotation representation methodology and test it out on six languages (English, Spanish, French, Arabic, Japanese, Korean, and Hindi). The semantic representation, however, is restricted to those aspects of syntax and semantics that developers believe can be consistently handled well by hand annotators for many languages. The current stage of development includes only syntax and light semantics - essentially, thematic roles.</Paragraph> </Section> <Section position="4" start_page="69" end_page="69" type="metho"> <SectionTitle> 3. In the ACE project </SectionTitle> <Paragraph position="0"> (http://www.ldc.upenn.edu/Projects/ACE/intro.htm l), annotators carry out manual semantic annotation of texts in English, Chinese and Arabic to create training and test data for research task evaluations.</Paragraph> <Paragraph position="1"> The downside of this effort is that the inventory of semantic entities, relations and events is very small and therefore the resulting semantic representations are coarse-grained: e.g., there are only five event types. The project description promises more fine-grained descriptors and relations among events in the future.</Paragraph> <Paragraph position="2"> 4. Another response to the insufficiency of syntax-only tagging is offered by the developers of PropBank, the Penn Treebank semantic extension.</Paragraph> <Paragraph position="3"> Kingsbury et al. 2002 report: &quot;It was agreed that the highest priority, and the most feasible type of semantic annotation, is coreference and predicate argument structure for verbs, participial modifiers and nominalizations&quot;, and this is what is included in PropBank.</Paragraph> <Paragraph position="4"> To summarize, previous tagging efforts that have addressed semantics at all have covered only a relatively small subset of semantic phenomena.</Paragraph> <Paragraph position="5"> OntoSem, by contrast, produces a far richer annotation, carried out largely automatically, within an environment that will improve over time and with use.</Paragraph> </Section> <Section position="5" start_page="69" end_page="70" type="metho"> <SectionTitle> 3 A Snapshot of OntoSem </SectionTitle> <Paragraph position="0"> OntoSem is a text-processing environment that takes as input unrestricted raw text and carries out preprocessing, morphological analysis, syntactic analysis, and semantic analysis, with the results of semantic analysis represented as formal text-meaning representations (TMRs) that can then be used as the basis for many applications (for details, see, e.g., Nirenburg and Raskin 2004, Beale et al.</Paragraph> <Paragraph position="1"> 2003). Text analysis relies on: * The OntoSem language-independent ontology, which is written using a metalanguage of description and currently contains around 6,000 concepts, each of which is described by an average of 16 properties.</Paragraph> <Paragraph position="2"> * An OntoSem lexicon for each language processed, which contains syntactic and semantic zones (linked using variables) as well as calls for procedural semantic routines when necessary.</Paragraph> <Paragraph position="3"> The semantic zone most frequently refers to ontological concepts, either directly or with property-based modifications, but can also describe word meaning extra-ontologically, for example, in terms of modality, aspect, time, etc. The current English lexicon contains approximately 25,000 senses, including most closed-class items and many of the most frequent and polysemous verbs, as targeted by corpus analysis. (An extensive description of the lexicon, formatted as a tutorial, can be found at http://ilit.umbc.edu.) which covers preprocessing, syntactic analysis, semantic analysis, and the creation of TMRs. Instead of using a large, monolithic grammar of a language, which leads to ambiguity and inefficiency, we use a special lexicalized grammar created on the fly for each input sentence (Beale, et. al. 2003). Syntactic rules are generated from the lexicon entries of each of the words in the sentence, and are supplemented by a small inventory of generalized rules. We augment this basic grammar with transformations triggered by words or features present in the input sentence.</Paragraph> <Paragraph position="4"> * The TMR language, which is the metalanguage for representing text meaning.</Paragraph> <Paragraph position="5"> Creating gold standard TMRs involves running text through the OntoSem processors and checking/correcting the output after three stages of analysis: preprocessing, syntactic analysis, and semantic analysis. These outputs can be viewed and edited as text or as visual representations through the DEKADE interface. Although the gold standard TMR itself does not reflect the results of preprocessing or syntactic analysis, the gold standard results of those stages of processing are stored in the system and can be converted into a more traditional annotation format.</Paragraph> </Section> <Section position="6" start_page="70" end_page="72" type="metho"> <SectionTitle> 4 TMRs in DEKADE </SectionTitle> <Paragraph position="0"> TMRs represent propositions connected by discourse relations (since space permits only the briefest of descriptions, interested readers are directed to Nirenburg and Raskin 2004, Chapter 6 for details). Propositions are headed by instances of ontological concepts, parameterized for modality, aspect, proposition time, overall TMR time, and style. Each proposition is related to other instantiated concepts using ontologically defined relations (which include case roles and many others) and attributes. Coreference links form an additional layer of linking between instantiated concepts. OntoSem microtheories devoted to modality, aspect, time, style, reference, etc., undergo iterative extensions and improvements in response to system needs as diagnosed during the processing of actual texts.</Paragraph> <Paragraph position="1"> We use the following sentence to walk through the processes of automatically generating TMRs and viewing/editing those TMRs to create a gold-standard annotated corpus.</Paragraph> <Paragraph position="2"> The Iraqi government has agreed to let U.S. Representative Tony Hall visit the country to assess the humanitarian crisis.</Paragraph> <Paragraph position="3"> Preprocessor. The preprocessor identifies the root word, part of speech and morphological features of each word; recognizes sentence boundaries, named entities, dates, times and numbers; and for named entities, determines the ontological type (i.e. HUMAN, PLACE, ORGANIZATION, etc.) of the entity as well as its subparts (e.g., the first, last, and middle names of a person). For the semi-automatic creation of gold standard TMRs, much ambiguity can be removed at small cost by allowing people to correct spurious part-of-speech tags, number and date boundaries, etc., through the DEKADE environment at the preprocessor stage (see Figure 1). Clicking on w+ permits a new POS tag/analysis, and clicking on w-, the more common action, removes spurious analyses. Preprocessor correction is a conceptually simple and logistically fast task that can be carried out by less trained, and Syntax. Syntax output can be viewed and edited in text or graphic form. The graphic viewer/editor presents the sentence using the traditional metaphor of color-coded labeled arcs.</Paragraph> <Paragraph position="4"> Mouse clicks show the components of arcs, permit arcs to be deleted along with the orphans they would leave, allow for the edges of arcs to be moved, etc. (no graphic of the syntax or semantics browsers/editors are provided due to space constraints). null One common error in syntax output is spurious parses due to contextually incorrect POS or feature analysis. As shown above, this can be fixed from the outset by correcting the preprocessor. However, since the preprocessor will always contain spurious analyses that can usually be removed automatically by the syntactic analyzer, it is not necessarily most time efficient to always start with preprocessor editing. A more difficult, long-term research issue is genuine ambiguity caused, for example, by PP-attachments. While such issues are not likely to be solved computationally in the short term, they can be easily resolved when humans are used as the final arbiters in the creation of gold standard TMRs.</Paragraph> <Paragraph position="5"> When the correct parse is not included in the syntactic output, either the necessary lexical knowledge is lacking (i.e. there is an unknown word or word sense), or an unknown grammatical construction has been used. While the syntaxediting interface permits spot-correction of the problem by the addition of the necessary arc(s), a more fundamental knowledge-building approach is generally preferred - except when the input is nonstandard, in which case systemic modifications are avoided.</Paragraph> <Paragraph position="6"> Semantics. Within the OntoSem environment, there are two stages of text-meaning representations (TMRs): basic and extended. The basic TMR shows the basic ontological mappings and dependency structure, whereas the extended TMR shows the results of procedural semantics, including reference resolution, reasoning about time relations, etc. The basic and extended stages of TMR creation can be viewed and edited separately within DEKADE.</Paragraph> <Paragraph position="7"> TMRs can be viewed and edited in text format or graphically. In the latter, concepts are shown as nodes and properties are shown as lines connecting them. A pretty-printed view of the textual extended TMR for our sample sentence, repeated for convenience, is as follows (concept names are in small caps; instance numbers are appended to them).</Paragraph> <Paragraph position="8"> The Iraqi government has agreed to let U.S.</Paragraph> <Paragraph position="9"> Representative Tony Hall visit the country to assess the humanitarian crisis.</Paragraph> <Paragraph position="10"> Within the graphical browser, clicking on concept names or properties permits them to be deleted, edited, or permits new ones to be added. It also shows the expansion of any concept in text format. Evaluating and editing the semantic output is the most challenging aspect of creating gold standard TMRs, since creating formal semantic representations is arguably one of the most difficult tasks in all of NLP. If a knowledge engineer determines that some aspect of the semantic representation is incorrect, the problems can be corrected locally or by editing the knowledge resources and rerunning the analyzer. Local corrections are used, for example, in cases of metaphor and metonymy, which we do not record in our knowledge resources (we are working on a microtheory of tropes but it is not yet implemented). In all other cases, resource supplementation is preferred; it can be carried out either immediately or the problem can be fixed locally, in which case a request will be sent to a knowledge acquirer to carry out the necessary resource enhancements.</Paragraph> <Paragraph position="11"> Striking the balance between short-term goals (a gold standard TMR for the given text) and long-term goals (better analysis of any text in the future) is always a challenge. For example, if a text contained the word grass in the sense of 'marijuana', and if the lexicon lacked the word 'grass' altogether, we would want to acquire the meaning 'green lawn cover' as well; however, doing this without constraint could mean getting bogged down by knowledge acquisition (as with the dozens of idiomatic uses of 'have') at the expense of actually producing gold-standard TMRs. There are also cases in which a local solution to semantic representation is very easy whereas a fundamental, machine-reproducible solution is very difficult.</Paragraph> <Paragraph position="12"> Consider the case of relative expressions, like respective and respectively, as used in Smith and Matthews pleaded innocent and guilty, respectively. Manually editing a TMR such that the appropriate properties are linked to their heads is quite simple, whereas writing a program for this non-trivial case of reference resolution is not.</Paragraph> <Paragraph position="13"> Thus, in some cases we push through gold standard TMR production while keeping track of - and developing as time permits - the more difficult aspects of text processing that will enhance TMR output in the future.</Paragraph> <Paragraph position="14"> The gold standard TMR for the sentence discussed at length here was produced with only a few manual corrections: changing two part of speech tags and selecting the correct sense for one word. Work took less than the 10 minutes reported by Brants 2000 for their non-semantic tagging.</Paragraph> </Section> <Section position="7" start_page="72" end_page="72" type="metho"> <SectionTitle> 5 Porting to Other Languages </SectionTitle> <Paragraph position="0"> Recently the need for tagged corpora for less commonly taught languages has received much attention. While our group is not currently pursuing such languages, it has in the past: TMRs have been automatically generated for languages such as Chinese, Georgian, Arabic and Persian. We take a short tangent to explain how OntoSem/DEKADE can be extended, at relatively low cost, to the annotation of other languages - showing yet another way in which this approach to annotation reaches beyond the results for any given text or corpus.</Paragraph> <Paragraph position="1"> Whereas it is typical to assume that lexicons are language-specific whereas ontologies are language-independent, most aspects of the semantic structures (sem-strucs) of OntoSem lexicon entries are actually language-independent, apart from the linking of specific variables to their counterparts in the syntactic structure. Stated differently, if we consider sem-strucs - no matter what lexicon they originate from - to be building blocks of the representation of word meaning (as opposed to concept meaning, as is done in the ontology), then we understand why building a large OntoSem lexicon for English holds excellent promise for future porting to other languages: most of the work is already done. This conception of cross-linguistic lexicon development derives in large part from the Principle of Practical Effability (Nirenburg and Raskin 2004), which states that what can be expressed in one language can somehow be expressed in all other languages, be it by a word, a phrase, etc. (Of course, it is not necessary that every nuanced meaning be represented in the lexicon of every language and, as such, there will be some differences in the lexical stock of each language: e.g., whereas German has a word for white horse which will be listed in its lexicon, English will not have such a lexical entry, the collocation white horse being treated compositionally.) We do not intend to trivialize the fact that creating a new lexicon is a lot of work. It is, however, compelling to consider that a new lexicon of the same quality of our OntoSem English one could be created with little more work than would be required to build a typical translation dictionary. In fact, we recently carried out an experiment on porting the English lexicon to Polish and found that a) much of it could be done semi-automatically and b) the manual work for a second language is considerably less than for the first language (for further discussion, see McShane et al. 2004).</Paragraph> <Paragraph position="2"> To sum up, the OntoSem ontology and the DEKADE environment are equally suited to any language, and the OntoSem English lexicon and analyzer can be configured to new languages with much less work required than for their initial development. In short, semantic-rich tagging through TMR creation could be a realistic option for languages other than English.</Paragraph> </Section> class="xml-element"></Paper>