File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-2703_concl.xml
Size: 1,799 bytes
Last Modified: 2025-10-06 13:55:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2703"> <Title>Tools to Address the Interdependence between Tokenisation and Standoff Annotation</Title> <Section position="7" start_page="24" end_page="25" type="concl"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> In this paper we have discussed the fact that tokenisation, especially of scienti c text, is not necessarily a component that can be got right rst time. In the context of annotation tools, especially where the tool makes reference to the tokenisation layer as with XML standoff, there is an interdependence between tokenisation and annotation. It is not practical to have annotators revisit their work every time the tokenisation component changes and so we have developed a tool that allows annotators to override tokenisation where necessary.</Paragraph> <Paragraph position="1"> The annotators' actions are recorded in the XML format in such a way that we can retokenise the corpus and still faithfully reproduce the original annotation. We have provided very speci c motivation for our approach from our annotation of the astronomy and biomedical domains but we hope that this method might be taken up as a standard elsewhere as it would provide bene ts when sharing corpora a corpus annotated in this way can be used by a third party and possibly retokenised by them to suit their needs. We also looked at the interdependence between the tokenisation used for annotation and the tokenisation requirements of POS taggers and NER taggers. We showed that it is important to provide a consistent tokenisation throughout and that experimentation is required before the optimal balance can be found. Our retokenisation tools support just this kind of experimentation null</Paragraph> </Section> class="xml-element"></Paper>