File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2703_intro.xml
Size: 3,084 bytes
Last Modified: 2025-10-06 14:04:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2703"> <Title>Tools to Address the Interdependence between Tokenisation and Standoff Annotation</Title> <Section position="2" start_page="0" end_page="19" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A primary consideration when designing an annotation tool for annotation tasks such as Named Entity (NE) annotation is to provide an interface that makes it easy for the annotator to select contiguous stretches of text for labelling (Carletta et al., 2003; Carletta et al., in press). This can be accomplished by enabling actions such as click and snapping to the ends of word tokens. Not only do such features make the task easier for annotators, they also help to reduce certain kinds of annotator error which can occur with interfaces which require the annotator to sweep out an area of text: without the safeguard of requiring annotations to span entire tokens, it is easy to sweep too little or too much text and create an annotation which takes in too few or too many characters. Thus the tokenisation of the text should be such that it achieves an optimal balance between increasing annotation speed and reducing annotation error rate. In Section 2 we describe a recently implemented XML-based annotation tool which we have used to create an NE-annotated corpus in the biomedical domain. This tool uses standoff annotation in a similar way to the NXT annotation tool (Carletta et al., 2003; Carletta et al., in press), though the annotations are recorded in the same le, rather than in a separate le.</Paragraph> <Paragraph position="1"> To perform annotation with this tool, it is necessary to rst tokenise the text and identify sentence and word tokens. We have found however that con icts can arise between the segmentation that the tokeniser creates and the segmentation that the annotator needs, especially in scienti c text where many details of correct tokenisation are not apparent in advance to a non-expert in the domain.</Paragraph> <Paragraph position="2"> We discuss this problem in Section 3 and illustrate it with examples from two domains, biomedicine and astrophysics.</Paragraph> <Paragraph position="3"> In order to meet requirements from both the annotation tool and the tokenisation needs of the annotators, we have extended our tool to allow the annotator to override the initial tokenisation where necessary and we have developed a method of recording the result of overriding in the XML mark-up. This allows us to keep a record of the optimal annotation and ensures that it will not be necessary to take the expensive step of having data reannotated in the event that the tokenisation needs to be redone. As improved tokenisation procedures become available we can retokenise both the annotated material and the remaining unannotated data using a program which we have developed for this task. We describe the extension to the annotation tool, the XML representation of con ict and the retokenisation program in Section 4.</Paragraph> </Section> class="xml-element"></Paper>