File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-2709_evalu.xml
Size: 5,433 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2709"> <Title>Interlingual Annotation of Multilingual Text Corpora</Title> <Section position="8" start_page="3" end_page="3" type="evalu"> <SectionTitle> 7 Evaluation </SectionTitle> <Paragraph position="0"> The evaluation criteria and metrics continue to evolve and are in the early stages of formation and implementation. Several possible courses for evaluating the annotations and resulting structures exist. In the first of these, the annotations are measured according to inter-annotator agreement. For this purpose, data is collected reflecting the annotations selected, the Omega nodes selected and the theta roles assigned. Then, inter-coder agreement is measured by a straightforward match, with agreement calculated by a Kappa measure (Carletta, 1993) and a Wood standard similarity (Habash and Dorr, 2002). This is done for three agreement points: annotations, Omega selection and theta role selection.</Paragraph> <Paragraph position="1"> At this time, the Kappa statistic's expected agreement is defined as 1/(N+1) where N is the number of choices at a given data point. In the case of Omega nodes, this means the number of matched Omega nodes (by string match) plus one for the possibility of the annotator traversing up or down the hierarchy. Multiple measures are used because it is important to have a mechanism for evaluating inter-coder consistency in the use of the IL representation language which does not depend on the assumption that there is a single correct annotation of a given text. The tools for evaluation have been modified from pervious use (Habash and Dorr, 2002).</Paragraph> <Paragraph position="2"> Second, the accuracy of the annotation is measured.</Paragraph> <Paragraph position="3"> Here accuracy is defined as correspondence to a predefined baseline. In the initial development phase, all sites annotated the same texts and many of the variations were discussed at that time, permitting the development of a baseline annotation. While not a useful long-term strategy, this produced a consensus baseline for the purpose of measuring the annotators' task and the solidity of the annotation standard.</Paragraph> <Paragraph position="4"> The final measurement technique derives from the ultimate goal of using the IL representation for MT, therefore, we are measuring the ability to generate accurate surface texts from the IL representation as annotated. At this stage, we are using an available generator, Halogen (Knight and Langkilde, 2000). A tool to convert the representation to meet Halogen requirements is being built. Following the conversion, surface forms will be generated and then compared with the originals through a variety of standard MT metrics (ISLE, 2003).</Paragraph> </Section> <Section position="9" start_page="3" end_page="3" type="evalu"> <SectionTitle> 8 Accomplishments and Issues </SectionTitle> <Paragraph position="0"> In a short amount of time, we have identified languages and collected corpora with translations. We have selected representation elements, from parser outputs to ontologies, and have developed an understanding of how their component elements fit together. A core markup vocabulary (e.g., entity-types, event-types and participant relations) was selected. An initial version of the annotator's toolkit (Tiamat) has been developed and has gone through alpha testing. The multi-layered approach to annotation decided upon reduces the burden on the annotators for any given text as annotations build upon one another. In addition to developing individual tools, an infrastructure exists for carrying out a multi-site annotation project.</Paragraph> <Paragraph position="1"> In the coming months we will be fleshing out the current procedures for evaluating the accuracy of an annotation and measuring inter-coder consistency.</Paragraph> <Paragraph position="2"> From this, a multi-site evaluation will be produced and results reported. Regression testing, from the intermediate stages and representations will be able to be carried out. Finally, a growing corpus of annotated texts will become available.</Paragraph> <Paragraph position="3"> In addition to the issues discussed throughout the paper, a few others have not yet been identified. From a content standpoint, looking at IL systems for time and location should utilize work in personal name, temporal and spatial annotation (e.g., Ferro et al., 2001). Also, an ideal IL representation would also account for causality, co-reference, aspectual content, modality, speech acts, etc. At the same time, while incorporating these items, vagueness and redundancy must be eliminated from the annotation language. Many inter-event relations would need to be captured such as entity reference, time reference, place reference, causal relationships, associative relationships, etc. Finally, to incorporate these, cross-sentence phenomena remain a challenge.</Paragraph> <Paragraph position="4"> From an MT perspective, issues include evaluating the consistency in the use of an annotation language given that any source text can result in multiple, different, legitimate translations (see Farwell and Helmreich, 2003) for discussion of evaluation in this light. Along these lines, there is the problem of annotating texts for translation without including in the annotations inferences from the source text.</Paragraph> </Section> class="xml-element"></Paper>