File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2703_metho.xml
Size: 21,551 bytes
Last Modified: 2025-10-06 14:10:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2703"> <Title>Tools to Address the Interdependence between Tokenisation and Standoff Annotation</Title> <Section position="3" start_page="19" end_page="19" type="metho"> <SectionTitle> 2 An XML-based Standoff Annotation Tool </SectionTitle> <Paragraph position="0"> In a number of recent projects we have explored the use of machine learning techniques for Named Entity Recognition (NER) and have worked with data from a number of different domains, including data from biomedicine (Finkel et al., in press; Dingare et al., 2004), law reports (Grover et al., 2004), social science (Nissim et al., 2004), and astronomy and astrophysics (Becker et al., 2005; Hachey et al., 2005). We have worked with a number of XML-based annotation tools, including the the NXT annotation tool (Carletta et al., 2003; Carletta et al., in press). Since we are interested only in written text and focus on annotation for Information Extraction (IE), much of the complexity offered by the NXT tool is not required and we have therefore recently implemented our own IEspeci c tool. This has much in common with NXT, in particular annotations are encoded as standoff with pointers to the indices of the word tokens. A screenshot of the tool being used for NE annotation of biomedical text is shown in Figure 1. Figure 2 contains a fragment of the XML underlying the annotation for the excerpt glutamic acid in the BH3 domain of tBid (tBidG94E) was principally used Note that the standoff annotation is stored at the bottom of the annotated le, not in a separate le.</Paragraph> <Paragraph position="1"> This is principally to simplify le handling issues which might arise if the annotations were stored separately. Word tokens are wrapped in w elements and are assigned unique ids in the id attribute. The tokenisation is created using significantly improved upgrades of the XML tools described in Thompson et al. (1997) and Grover et al. (2000)1. The ents element contains all the entities that the annotator has marked and the link between the ent elements and the words is encoded with the sw and ew attributes (start word and end word) which point at word ids. For example, the protein fragment entity with id e7 starts at the rst character of the word with id w630 and ends at the last character of the word with id w644.</Paragraph> <Paragraph position="2"> Our annotation tool and the format for storing annotations that we have chosen are just one instance of a wide range of possible tools and formats for the NE annotation task. There are a number of decision points involved in the development of such tools, some of which come down to a matter of preference and some of which are consequences of other choices. Examples of annotation methods which are not primarily based on XML are GATE (Cunningham et al., 2002) and the annotation graph model of Bird and Liberman (2001).</Paragraph> <Paragraph position="3"> The GATE system organises annotations in graphs where the start and end nodes have pointers into the source document character offsets. This is an adaptation of the TIPSTER architecture (Grishman, 1997). (The UIMA system from IBM (Ferrucci and Lally, 2004) also stores annotations in a TIPSTER- null codes annotations as a directed graph with elded records on the arcs and optional time references on the nodes. This is broadly compatible with our standoff XML representation and with the TIPSTER architecture. Our decision to use an annotation tool which has an underlying XML representation is partly for compatibility with our NLP processing methodology where a document is passed through a pipeline of XML-based components. A second motivation is the wish to ensure quality of annotation by imposing the constraint that annotations span complete XML elements. As explained above and described in more detail in Section 4 the consequence of this approach has been that we have had to develop a method for recording cases where the tokenisation is inconsistent with an annotator's desired action so that subsequent retokenisation does not require reannotation.</Paragraph> </Section> <Section position="4" start_page="19" end_page="22" type="metho"> <SectionTitle> 3 Tokenisation Issues </SectionTitle> <Paragraph position="0"> The most widely known examples of the NER task are the MUC competitions (Chinchor, 1998) and the CoNLL 2002 and 2003 shared task (Sang, 2002; Sang and De Meulder, 2003). In both cases the domain is newspaper text and the entities are general ones such as person, location, organisation etc. For this kind of data there are unlikely to be con icts between tokenisation and entity mark-up and a vanilla tokenisation that splits at whitespace and punctuation is adequate. When dealing with scienti c text and entities which refer to technical concepts, on the other hand, much more care needs to be taken with tokenisation.</Paragraph> <Paragraph position="1"> In the SEER project we collected a corpus of abstracts of radio astronomical papers taken from the</Paragraph> <Section position="1" start_page="19" end_page="21" type="sub_section"> <SectionTitle> NASA Astrophysics Data System archive, a dig- </SectionTitle> <Paragraph position="0"> ital library for physics, astrophysics, and instru- null mentation2. We annotated the data for the following four entity types: Instrument-name Names of telescopes and other measurement instruments, e.g. Supercon- null J104433.04-012502.2, PC0953+ 4749.</Paragraph> <Paragraph position="1"> Source-type Types of objects, e.g. Type II Supernovae (SNe II), radio-loud quasar, type 2 QSO, starburst galaxies, low-luminosity AGNs.</Paragraph> <Paragraph position="2"> Spectral-feature Features that can be pointed to on a spectrum, e.g. Mg II emission, broad emission lines, radio continuum emission at 1.47 GHz, CO ladder from (2-1) up to (7-6), non-LTE line. In the Text Mining programme (TXM) we have collected a corpus of abstracts and full texts of biomedical papers taken from PubMed Central, the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature3. We have begun to annotate the data for the following four entity types: Protein Proteins, both full names and acronyms, e.g. p70 S6 protein kinase, Kap-1, p130(Cas).</Paragraph> <Paragraph position="3"> of proteins e.g. a0a2a1a4a3a6a5a8a7a10a9a12a11a10a11a13a7 , a domain of Bub1, nup53-a14 405-430.</Paragraph> <Paragraph position="4"> or more proteins e.g. Kap95p/Kap60, DOCK2-ELMO1, RENT complex. Note that nesting of protein entities inside complexes may occur.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Fusion Protein Fusions of two proteins or pro- </SectionTitle> <Paragraph position="0"> tein fragments e.g. a15 -catenin-Lef1, GFP-tubulin, GFP-EB1. Note that nesting of protein entities inside fusions may occur.</Paragraph> <Paragraph position="1"> In both the astronomy and biomedical domains, there is a high density of technical and formulaic language (e.g. from astronomy: a16a18a17a19a16a18a20a22a21a24a23a26a25</Paragraph> <Paragraph position="3"> that the vanilla tokenisation style that was previously adequate for MUC-style NE annotation in generic newspaper text is no longer guaranteed to be a good basis for standoff NE annotation because there will inevitably be con icts between the way the tokenisation segments the text and the strings that the annotators want to select. In the remainder of this section we illustrate this point with examples from both domains.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.1 Tokenisation of Astronomy Texts </SectionTitle> <Paragraph position="0"> In our tokenisation of the astronomy data, we initially assumed a vanilla MUC-style tokenisation which gives strong weight to whitespace as a token delimiter. This resulted in 'words' such Si[I]a51 0.4 and I([OIII]) being treated as single tokens. Retokenisation was required because the annotators wanted to highlight Si[I] and [OIII] as entities of type Spectral-feature. We also initially adopted the practice of treating hyphenated words as single tokens so that examples such as AGN-dominated in the Source-type entity AGN-dominated NELGs were treated as one token. In this case the annotator wanted to mark AGN as an embedded Source-type entity but was unable to do so. A similar problem occurred with the Spectral-feature BAL embedded in the Source-type entity mini-BAL quasar.</Paragraph> <Paragraph position="1"> Examples such as these required us to retokenise the astronomy corpus. We then performed a one-off, ad hoc merger of the annotations that had already been created with the newly tokenised version and then asked the annotators to revisit the examples that they had previously been unable to annotate correctly.</Paragraph> </Section> <Section position="4" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 3.2 Tokenisation of Biomedical Texts </SectionTitle> <Paragraph position="0"> Our starting point for tokenisation of biomedical text was to use the ner grained tokenisation that we had developed for the astronomy data in preference to a vanilla MUC-style tokenisation. For the most part this resulted in a useful tokenisation; for example, rules to split at hyphens and slashes resulted in a proper tokenisation of protein complexes such as Kap95p/Kap60 and DOCK2-ELMO1 which allowed for the correct annotation of both the complexes and the proteins embedded within them. However, a slash did not always cause a token split and in cases such as ERK 1/2 the 1/2 was treated as one token which prevented the annotator from marking up ERK characters meant that sequences containing Greek characters became single tokens when sometimes they should have been split. For example, in the string PKCa0 K380R the annotator wanted to mark PKC as a protein. Material in parentheses when not preceded by white space was not split off so that in examples such as coilin(C214) and Cdt1(193-447) the annotators were not able to mark up just the material before the left parenthesis. Sequences of numeric and (possibly mixedcase) alphabetic characters were treated as single tokens, e.g., tBidG94E (see Figure 2), GAL4AD, p53TAD in these cases the annotators wanted to mark up an initial subpart (tBid, GAL4, p53).</Paragraph> </Section> </Section> <Section position="5" start_page="22" end_page="23" type="metho"> <SectionTitle> 4 Representing Conflict in XML and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="22" end_page="23" type="sub_section"> <SectionTitle> Retokenisation </SectionTitle> <Paragraph position="0"> Some of the tokenisation problems highlighted in the previous section arose because the NLP specialist implementing the tokenisation rules was not an expert in either of the two domains. Many initial problems could have been avoided by a phase of consultation with the astronomy and biomedical domain experts. However, because they are not NLP experts, it would have been time-consuming to explain the NLP issues to them.</Paragraph> <Paragraph position="1"> Another approach could have been to use extremely ne-grained tokenisation perhaps splitting tokens on every change in character type.</Paragraph> <Paragraph position="2"> Another way in which many of the problems could have been avoided might have been to use extremely ne-grained tokenisation perhaps splitting tokens on every change in character type. This would provide a strong degree of harmony between tokenisation and annotation but would be inadvisable for two reasons: rstly, segmentation into many small tokens would be likely to slow annotation down as well as give rise to more accidental mis-annotations because the annotators would need to drag across more tokens; secondly, while larger numbers of smaller tokens may be useful for annotation, they are not necessarily appropriate for many subsequent layers of linguistic processing (see Section 5).</Paragraph> <Paragraph position="3"> The practical reality is that the answer to the question of what is the 'right' tokenisation is far from obvious and that what is right for one level of processing may be wrong for another. We anticipate that we might tune the tokenisation component a number of times before it becomes xed in its nal state and we need a framework that permits us this degree of freedom to experiment without jeopardising the annotation work that has already been completed.</Paragraph> <Paragraph position="4"> Our response to the con ict between tokenisation and annotation is to extend our XML-based standoff annotation tool so that it can be used by the annotators to record the places where the current tokenisation does not allow them to select a string that they want to annotate. In these cases they can override the default behaviour of the annotation tool and select exactly the string they are interested in. When this happens, the standoff annotation points to the word where the entity starts and the word where it ends as usual, but it also records start and end character offsets which show exactly which characters the annotator included as part of the entity. The protein entity e10 in the example in Figure 2 illustrates this technique: the start and end word attributes sw and ew indicate that the entity encompasses the single token tBidG94E but the attribute eo (end offset) indicates that the annotator selected only the string tBid. Note that the annotator also correctly anno- null tated the entire string tBidG94E as a protein fragment. The start and end character offset notation provides a subset of the range descriptions de ned in the XPointer draft speci cation4.</Paragraph> <Paragraph position="5"> With this method of storing the annotators' decisions, it is now possible to update the tokenisation component and retokenise the data at any point during the annotation cycle without risk of losing completed annotation and without needing to ask annotators to revisit previous work. We have developed a program which takes as input the original annotated document plus a newly tokenised but unannotated version of it and which causes the correct annotation to be recorded in the retokenised version. Where the retokenisation accords with the annotators' needs there will be a decrease in the incidence of start and end offset attributes. Figure 3 shows the output of retokenisation on our example. The current version of the TXM project corpus contains 38,403 sentences which have been annotated for the four protein named entities described above (50,049 entity annotations). With the initial tokenisation (Tok1) there are 1,106,279 tokens and for 719 of the entities the annotators have used start and/or end offsets to override the tokenisation. We have dened a second, ner-grained tokenisation (Tok2) and used our retokenisation program to retokenise the corpus. This second version of the corpus contains 1,185,845 tokens and the number of entity annotations which con ict with the new tokenisation is reduced to 99. Some of these remaining cases re ect annotator errors while some are a consequence of the retokenisation still not being ne-grained enough. When using the annotations for training or testing, we still need a strategy for dealing with the annotations that are not consistent with our nal automatic tokenisation routine (in our case, the 99 entities). We can systematically ignore the annotations or adjust them to the nearest token boundary. The important point is we we have recorded the mismatch between the tokenisation and the desired annotation and we have options for dealing with the discrepancy.</Paragraph> </Section> </Section> <Section position="6" start_page="23" end_page="24" type="metho"> <SectionTitle> 5 Tokenisation for Multiple Components </SectionTitle> <Paragraph position="0"> So far we have discussed the problem of nding the correct level of granularity of tokenisation purely in terms of obtaining the optimal basis for NER annotation. However, the reason for ob- null taining annotated data is to provide training material for NLP components which will be put together in a processing pipeline to perform information extraction. Given that statistically trained components such as part-of-speech (POS) taggers and NER taggers use word tokens as the fundamental unit over which they operate, their needs must be taken into consideration when deciding on an appropriate granularity for tokenisation. The implicit assumption here is that there can only be one layer of tokenisation available to all components and that this is the same layer as is used at the annotation stage. Thus, if annotation requires the tokenisation to be relatively ne-grained, this will have implications for POS and NER tagging. For example, a POS tagger trained on a more conventionally tokenised dataset might have no problem assigning a propernoun tag to Met-tRNA/eIF2 a0 in ... and facilitates loading of the MettRNA/eIF2 a0 GTP ternary complex ...</Paragraph> <Paragraph position="1"> however, it would nd it harder to assign tags to members of the 10 token sequence M et - t RNA / e IF 2 a0 .</Paragraph> <Paragraph position="2"> Similarly, a statistical NER tagger typically uses information about left and right context looking at a number of tokens (typically one or two) on either side. With a very ne-grained tokenisation, this representation of context will potentially be less informative as it might contain less actual context. For example, in the excerpt ... using a Tet-on LMP1 HNE2 cell line ...</Paragraph> <Paragraph position="3"> assuming a ne-grained tokenisation, the pair of tokens LMP and 1 make up a protein entity. The left context would be the sequence using a Tet - on and the right context would be HNE 2 cell line. Depending on the size of window used to capture context this may or may not provide useful information.</Paragraph> <Paragraph position="4"> To demonstrate the effect that a ner-grained tokenisation can have on POS and NER tagging, we performed a series of experiments on the NER annotated data provided for the Coling BioNLP evaluation (Kim et al., 2004), which was derived from the GENIA corpus (Kim et al., 2003). (The BioNLP data is annotated with ve entities, protein, DNA, RNA, cell type and cell line.) We trained the C&C maximum entropy tagger (Curran and Clark, 2003) using default settings to obtain of the BioNLP corpus NER models for the original tokenisation (Orig), a retokenisation using the rst TXM tokeniser (Tok1) and a retokenisation using the ner-grained second TXM tokeniser (Tok2) (see Section 4). In all experiments we discarded the original POS tags and performed POS tagging using the C&C tagger trained on the MedPost data (Smith et al., 2004). Table 1 shows precision, recall and f-score for the NER tagger trained and tested on these three tokenisations and it can be seen that performance drops as tokenisation becomes more ne-grained.</Paragraph> <Paragraph position="5"> The results of these experiments indicate that care needs to be taken to achieve a sensible balance between the needs of the annotation and the needs of NLP modules. We do not believe, however, that the results demonstrate that the less ne-grained original tokenisation is necessarily the best. The experiments are a measure of the combined performance of the POS tagger and the NER tagger and the tokenisation expectations of the POS tagger must also have an impact. We used a POS tagger trained on material whose own tokenisation most closely resembles the tokenisation of Orig (hyphenated words are not split in the MedPost training data) and it is likely that the low results for Tok1 and Tok2 are partly due to the tokenisation mismatch between training and testing material for the POS tagger. In addition, the NER tagger was used with default settings for all runs where the left and right context is at most two tokens. We might expect an improvement in performance for Tok1 and Tok2 if the NER tagger was run with larger context windows. The overall message here, therefore, is that the needs of all processors must be taken into account when searching for an optimal tokenisation and developers should beware of bolting together components which have different expectations of the tokenisation ideally each should be tuned to the same tokenisation.</Paragraph> <Paragraph position="6"> There is a further reason why the original tokenisation of the BioNLP data works so well.</Paragraph> <Paragraph position="7"> During our experiments with the original data we observed that splitting at hyphens was normally not done (e.g. monocyte-specific is one token) but wherever an entity was part of a hyphenated word then it was split (e.g. IL-2 -independent where IL-2 is marked as a protein.) The context of a following word which begins with a hyphen is thus a very clear indicator of entityhood. Although this will improve scores where the training and testing data are marked in the same way, it gives an unrealistic estimate of actual performance on unseen data where we would not expect the hyphenation strategy of an automatic tokeniser to be dependent on prior knowledge of where the entities are. To demonstrate that the Orig NER model does not perform well on differently tokenised data, we tested it on the Tok1 tokenised evaluation set and obtained an f-score of 55.64%.</Paragraph> </Section> class="xml-element"></Paper>