File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3111_metho.xml
Size: 20,452 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3111"> <Title>Integrated Annotation for Biomedical Information Extraction</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Guidelines for Entity Annotation </SectionTitle> <Paragraph position="0"> Annotation has been proceeding for both the oncology and the inhibition domains. Here we give a summary of the main features of the annotation guidelines that have been developed. We have been influenced in this by previous work in annotation for biomedical information extraction (Ohta et al., 2002; Gaizauskas et al., 2003). However, we differ in the domains we are annotating and the design philosophy for the entity guidelines. For example, we have been concentrating on explicit concepts for entities like genes rather than developing a wide-range ontology for the various physical instantiations.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Oncology Domain </SectionTitle> <Paragraph position="0"> Gene Entity For the sake of this project the definition for &quot;Gene Entity&quot; has two significant characteristics. First, &quot;Gene&quot; refers to a composite entity as opposed to the strict biological definition. As has been noted by others, there are often ambiguities in the usage of the entity names. For example, it is sometimes unclear as to whether it is the gene or protein being referenced, or the same name might refer to the gene or the protein at different locations in the same document. Our approach to this problem is influenced by the named entity annotation in the Automatic Content Extraction (ACE) project (Consortium, 2002), in which &quot;geopolitical&quot; entities can have different roles, such as &quot;location&quot; or &quot;organization&quot;. Analogously, we consider a &quot;gene&quot; to be a composite entity that can have different roles throughout a document.</Paragraph> <Paragraph position="1"> Standardization of &quot;Gene&quot; references between different texts and between gene synonyms is handled by externally referencing each instance to a standard ontology (Ashburner et al., 2000).</Paragraph> <Paragraph position="2"> In the context of this project, &quot;Gene&quot; refers to a conceptual entity as opposed to the specific manifestation of a gene (i.e. an allele or nucleotide sequence). Therefore, we consider genes to be abstract concepts identifying genomic regions often associated with a function, such as MYC or TrkB; we do not consider actual instances of such genes within the gene-entity domain. Since we are interested in the association between Gene-entities and malignancies, for this project genes are of interest to us when they have an associated variation event. Therefore, the combination of Gene entities and Variation events provides us with an evoked entity representing the specific instance of a gene.</Paragraph> <Paragraph position="3"> Variation Events as Relations Variations comprise a relationship between the following entities: Type (e.g.</Paragraph> <Paragraph position="4"> point mutation, translocation, or inversion), Location (e.g. codon 14, 1p36.1, or base pair 278), Original-State (e.g. Alanine), and Altered-State (e.g. Thymine). These four components represent the key elements necessary to describe any genomic variation event. Variations are often underspecified in the literature, frequently having only two or three of these specifications. Characterizing individual variations as a relation among such components provides us with a great deal of flexibility: 1) it allows us to capture the complete variation event even when specific components are broadly spaced in the text, often spanning multiple sentences or even paragraphs; 2) it provides us with a convenient means of tracking anaphora between detailed descriptions (e.g. a point mutation at codon 14 and summary references (e.g. this variation); and 3) it provides a single structure capable of capturing the breadth of variation specifications (e.g. A-a5 T point mutation at base pair 47, A48-a5 G or t(11;14)(q13;32)).</Paragraph> <Paragraph position="5"> Malignancy The guidelines for malignancy annotation are under development. We are planning to define it in a manner analogous to variation, whereby a Malignancy is composed of various attribute types (such as developmental stage, behavior, topographic site, and morphology).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 CYP Domain </SectionTitle> <Paragraph position="0"> In the CYP Inhibition annotation task we are tagging three types of entities: 1. CYP450 enzymes (cyp) 2. other substances (subst) 3. quantitative measurements (quant) Each category has its own questions and uncertainties. Names like CYP2C19 and cytochrome P450 enzymes proclaim their membership, but there are many aliases and synonyms that do not proclaim themselves, such as 17,20-lyase. We are compiling a list of such names.</Paragraph> <Paragraph position="1"> Other substances is a potentially huge and vaguelydelimited set, which in the current corpus includes grapefruit juice and red wine as well as more obviously biochemical entities like polyunsaturated fatty acids and erythromycin. The quantitative measurements we are directly interested in are those directly related to inhibition, such as IC50 and K(i). We tag the name of the measurement, the numerical value, and the unit. For example, in the phrase ...was inhibited by troleandomycin (ED50 = 1 microM), ED50 is the name, 1 the value, and microM the unit. We are also tagging other measurements, since it is easy to do and may provide valuable information for future IE work.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Integrated Annotation </SectionTitle> <Paragraph position="0"> As has been noted in the literature on biomedical IE (e.g., (Pustejovsky et al., 2002; Yakushiji et al., 2001)), the same relation can take a number of syntactic forms. For example, the family of words based on inhibit occurs commonly in MEDLINE abstracts about CYP enzymes (as in the example in the introduction) in patterns like A inhibited B, A inhibited the catalytic activity of B, inhibition of B by A, etc.</Paragraph> <Paragraph position="1"> Such alternations have led to the use of pattern-matching rules (often hand-written) to match all the relevant configurations and fill in template slots based on the resulting pattern matches. As discussed in the introduction, dealing with such complications in patterns can take much time and effort.</Paragraph> <Paragraph position="2"> Our approach instead is to build an annotated corpus in which the predicate-argument information is annotated on top of the parsing annotations in the Treebank, the resulting corpus being called a &quot;proposition bank&quot; or Propbank. This newly annotated corpus is then used for training processors that will automatically extract such structures from new examples.</Paragraph> <Paragraph position="3"> In a Propbank for biomedical text, the types of inhibit examples listed above would consistently have their compounds labeled as Arg0 and their enzymes labeled as Arg1, for nominalized forms such as A is an inhibitor of B, A caused inhibition of B, inhibition of B by A, as well the standard A inhibits B. We would also be able to label adjuncts consistently, such as the with prepositional phrase in CYP3A4 activity was decreased by L, S and F with IC(50) values of about 200 mM. In accordance with other Calibratable verbs such as rise, fall, decline, etc., this phrase would be labeled as an Arg2-EXTENT, regardless of its syntactic role.</Paragraph> <Paragraph position="4"> A Propbank has been built on top of the Penn Treebank, and has been used to train &quot;semantic taggers&quot;, for extracting argument roles for the predicates of interest, regardless of the particular syntactic context.1 Such semantic taggers have been developed by using machine learning techniques trained on the Penn Prop-bank (Surdeanu et al., 2003; Gildea and Palmer, 2002; Kingsbury and Palmer, 2002). However, the Penn Tree-bank and Propbank involve the annotation of Wall Street Journal text. This text, being a financial domain, differs in significant ways from the biomedical text, and so it is of nominal predicate structure. This is particular relevant for the biomedical domain, given the heavy use of nominals such mutation and inhibition.</Paragraph> <Paragraph position="5"> necessary for this approach to have a corpus of biomedical texts such as MEDLINE articles annotated for both syntactic structure (Treebanking) and shallow semantic structure (Propbanking).</Paragraph> <Paragraph position="6"> In this project, the syntactic and semantic annotation is being done on a corpus which is also being annotated for entities, as described in Section 2. Since semantic taggers of the sort described above result in semantic roles assigned to syntactic tree constituents, it is desirable to have the entities correspond to syntactic constituents so that the semantic roles are assigned to entities. The entity information can function as type information and be taken advantage of by learning algorithms to help characterize the properties of the terms filling specified roles in a given predicate.</Paragraph> <Paragraph position="7"> This integration of these three different annotation levels, including the entities, is being done for the first time2, and we discuss here three main challenges to this correspondence between entities and constituents: (1) entities that are large enough to cut across multiple constituents, (2) entities within prenominal modifiers, and (3) coordination.3 null cern is the possibility of entities that contain more than one syntactic constituent and do not match any node in the syntax tree. For example, as discussed in Section 2, a variation event includes material on a variation's type, location, and state, and can cut not only across constituents, but even sentences and paragraphs. A simple example is point mutations at codon 12, containing both the nominal (the type of mutation) and following NP (the location).</Paragraph> <Paragraph position="8"> Note that while in isolation this could also be considered one syntactic constituent, the NP and PP together, the actual context is ...point mutations at codon 12 in duodenal lavage fluid.... Since all PPs are attached at the same level, at codon 12 and in duodenal lavage fluid are sisters, and so there is no constituent consisting of just point mutations at codon 12.</Paragraph> <Paragraph position="9"> Casting the variation event as a relation between different component entities allows the component entities to correspond to tree constituents, while retaining the capacity to annotate and search for more complex events.</Paragraph> <Paragraph position="10"> In this case, one component entity point mutations cor2An influential precursor to this integration is the system described in (Miller et al., 1996). Our work is in much the same spirit, although the representation of the predicate-argument structure via Propbank and the linkage to the entities is quite different, as well as of course the domain of annotation.</Paragraph> <Paragraph position="11"> 3There are cases where the entities are so minimal that they are contained within a NP, not including the determiner, such as CpG site in the NP a CpG site. entities. We are not as concerned about these cases since we expect that such entity information properly contained within a base NP can be associated with the full base NP.</Paragraph> <Paragraph position="12"> responds to a (base) NP node, and at codon 12 is corresponds to the PP node that is the NP's sister. At the same time, the relation annotation contains the information relating these two constituents.</Paragraph> <Paragraph position="13"> Similarly, while the malignancy entity definition is currently under development, as mentioned in Section 2.1, a guiding principle is that it will also be treated as a relation and broken down into component entities. While this also has conceptual benefits for the annotation guidelines, it has the fortunate effect of making such otherwise syntaxunfriendly malignancies as colorectal adenomas containing early cancer and acute myelomonocytic leukemia in remission amenable for mapping the component parts to syntactic nodes.</Paragraph> <Paragraph position="14"> Entities within Prenominal Modifiers While we are for the most part following the Penn Treebank guidelines (Bies et al., 1995), we are modifying them in two important aspects. One concerns the prenominal modifiers, which in the Penn Treebank were left flat, with no structure, but in this biomedical domain contain much of the information - e.g., cancer-associated autoimmune antigen. Not only would this have had no annotation for structure, but even more bizarrely, cancer-associated would have been a single token in the Penn Treebank, thus making it impossible to capture the information as to what is associated with what. We have developed new guidelines to assign structure to prenominal entities such as breast cancer, as well as changed the tokenization guidelines to break up tokens such as cancer-associated.</Paragraph> <Paragraph position="15"> Coordination We have also modified the treebank annotation to account for the well-known problem of entities that are discontinuous within a coordination structure - e.g., K- and H-ras, where the entities are K-ras and Hras. Our annotation tool allows for discontinuous entities, so that both K-ras and H-ras are annotated as genes.</Paragraph> <Paragraph position="16"> Under standard Penn Treebank guidelines for tokenization and syntactic structure, this would receive the flat structure</Paragraph> <Paragraph position="18"> in which there is no way to directly associate the entity K-ras with a constituent node.</Paragraph> <Paragraph position="19"> We have modified the treebank guidelines so that K-ras and H-ras are both constituents, with the ras part of K-ras represented with an empty category co-indexed with ras</Paragraph> <Paragraph position="21"/> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Annotation Process </SectionTitle> <Paragraph position="0"> We are currently annotating MEDLINE abstracts for both the oncology and CYP domains. The flowchart for the annotation process is shown in Figure 1. Tokenization, POS-tagging, entity annotation (both domains), and treebanking are in full production. Propbank annotation and the merging of the entities and treebanking remain to be integrated into the current workflow. The table in Figure 2 shows the number of abstracts completed for each annotation area.</Paragraph> <Paragraph position="1"> The annotation sequence begins with tokenization and part-of-speech annotating. While both aspects are similar to those used for the Penn Treebank, there are some differences, partly alluded to in Section 3. Tokens are somewhat more fine-grained than in the Penn Treebank, so that H-ras, e.g., would consist of three tokens: H, -, and ras.</Paragraph> <Paragraph position="2"> Tokenized and part-of-speech annotated files are then sent to the entity annotators, either for oncology or CYP, depending on which domain the abstract has been chosen for. The entities described in Section 2 are annotated at this step. We are using WordFreak, a Java-based linguistic annotation tool5, for annotation of tokenization, POS, and entities. Figure 3 is a screen shot of the oncology domain annotation, here showing a variation relation being created out of component entities for type and location.</Paragraph> <Paragraph position="3"> In parallel with the entity annotation, a file is treebanked - i.e., annotated for its syntactic structure. Note that this is done independently of the entity annotation.</Paragraph> <Paragraph position="4"> This is because the treebanking guidelines are relatively stable (once they were adjusted for the biomedical domain as described in Section 3), while the entity definitions can require a significant period of study before stabilizing, and with the parallel treatment the treebanking can proceed without waiting for the entity annotation.</Paragraph> <Paragraph position="5"> However, this does mean that to produce the desired integrated annotation, the entity and treebanking annotations need to be merged into one representation. The consideration of the issues described in Section 3 has been carried out for the purpose of allowing this integration of the treebanking and entity annotation. This has been completed for some pilot documents, but the full merging remains to be integrated into the workflow system.</Paragraph> <Paragraph position="6"> As mentioned in the introduction, statistical taggers are being developed in parallel with the annotation effort. While such taggers are part of the final goal of the project, providing the building blocks for extracting entities and relations, they are also useful in the annotation process itself, so that the annotators only need to perform correction of automatically tagged data, instead of starting from scratch.</Paragraph> <Paragraph position="7"> Until recently (Feb. 10), the part-of-speech annotation was done by hand-correcting the results of tagging the data with a part-of-speech tagger trained on a modified form of the Penn Treebank.6 The tagger is a maximum-entropy model utilizing the opennlp package available at http://www.sf.net/projects/opennlp. It has now been retrained using 315 files (122 from the oncology domain, 193 from the cyp domain). Figure 4 shows the improvement of the new vs. the old POS tagger on the same 294 files that have been hand-corrected. These results are based on testing files that have already been tokenized, and thus are an evaluation only of the POS tagger and not the tokenizer. While not directly comparable to results such as (Tateisi and Tsujii, 2004), due to the different tag sets and tokenization, they are in the same general range.7 The oncology and cyp entity annotation, as well as the treebanking are still being done fully manually, although that will change in the near future. Initial results for a tagger to identify the various components of a variation relation are promising, although not yet integrated into annotation process. The tagger is based on the implementation of Conditional Random Fields (Lafferty et al., 2001) in the Mallet toolkit (McCallum, 2002). Briefly, Conditional Random Fields are log-linear models that rely on weighted features to make predictions on the input. Features used by our system include standard pattern matching and word features as well as some expert-created regular expression features8. Using 10-fold cross-validation on 264 labelled abstracts containing 551 types, 1064 lo6Roughly, Penn Treebank tokens were split at hyphens, with the individual components then sent through a Penn Treebank-trained POS tagger, to create training data for another POS tagger. For example (JJ York-based) is treated as (NNP York) (HYPH -) (JJ based). While this works reasonably well for tokenization, the POS tagger suffered severely from being trained on a corpus with such different properties. 7The tokenizer has also been retrained and the new tokenizer is being used for annotation, although although we do not have the evaluation results here.</Paragraph> <Paragraph position="8"> An entity is considered correctly identified if and only if it matches the human labeling by both category (type, location or state) and span (from position a to position b). At this stage we have not distinguished between initial and final states.</Paragraph> <Paragraph position="9"> While it is difficult to compare taggers that tag different types of entities (e.g., (Friedman et al., 2001; Gaizauskas et al., 2003)), CRFs have been utilized for state-of-the-art results in NP-chunking and gene and protein tagging (Sha and Pereira, 2003; McDonald and Pereira, 2004) Currently, we are beginning to investigate methods to identify relations over the variation components that are extracted using the entity tagger.</Paragraph> </Section> class="xml-element"></Paper>