File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3101_intro.xml
Size: 17,368 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3101"> <Title>A resource for constructing customized test suites for molecular biology entity identification systems</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper describes a methodology and data for the testing of molecular biology entity identification (EI) systems by developers and end users. Molecular biology EI systems find names of genes and gene products in free text. Several years' publication history has established precision, recall, and F-score as the de facto standards for evaluating EI systems for molecular biology texts at the publication stage and in competitions like BioCreative (www.mitre.org/public/biocreative). These measures provide important indices of a system's overall output quality. What they do not provide is the detailed sort of information about system performance that is useful for the system developer who is attempting to assess the strengths and weaknesses of a work in progress, nor do they provide detailed information to the potential consumer who would like to compare two systems against each other. Hirschman and Mani (2003) point out that different evaluation methods are useful at different points in the software life-cycle. In particular, what they refer to as feature-based evaluation via test suites is useful at two points: in the development phase, and for acceptance testing. We describe here a methodology and a set of data for constructing customized feature-based test suites for EI in the molecular biology domain. The data consists of two sets. One is a set of names and symbols of entities as that term is most commonly understood in the molecular biology domain--genes and gene products.</Paragraph> <Paragraph position="1"> (Sophisticated ontologies such as GENIA (Ohta et al.</Paragraph> <Paragraph position="2"> 2002) include other kinds of entities relevant to molecular biology as well, such as cell lines.) The names and symbols exemplify a wide range of the features that characterize entities in this domain--case variation, presence or absence of numbers, presence or absence of hyphenation, etc. The other is a set of sentences that exemplify a range of sentential contexts in which the entities can appear, varying with respect to position of the entity in the sentence (initial, medial, or final), presence of keywords like gene and protein, tokenization issues, etc. Both the entities and the sentential contexts are classified in terms of a taxonomy of features that are relevant to this domain in particular and to natural language processing and EI in general.</Paragraph> <Paragraph position="3"> The methodology consists of generating customized test suites that address specific performance issues by combining sets of entities that have particular characteristics with sets of contexts that have particular characteristics. Logical combination of subsets of characteristics of entities and contexts allows the developer to assess the effect of specific characteristics on performance, and allows the user to assess performance of the system on types of inputs that are of Association for Computational Linguistics.</Paragraph> <Paragraph position="4"> Linking Biological Literature, Ontologies and Databases, pp. 1-8. HLT-NAACL 2004 Workshop: Biolink 2004, particular interest to them. For example, if the developer or end-user wants to assess the ability of a system to recognize gene symbols with a particular combination of letter case, hyphenation, and presence or absence of numerals, the data and associated code that we provide can be used to generate a test suite consisting of symbols with and without that combination of features in a variety of sentential contexts.</Paragraph> <Paragraph position="5"> Inspiration for this work comes on the one hand from standard principles of software engineering and software testing, and on the other hand from descriptive linguistics (Harris 1951, Samarin 1967). In Hirschman and Mani's taxonomy of evaluation techniques, our methodology is referred to as feature-based, in that it is based on the principle of classifying the inputs to the system in terms of some set of features that are relevant to the application of interest. It is designed to provide the developer or user with detailed information about the performance of her EI system. We apply it to five molecular biology EI and information extraction systems: ABGene (Tanabe and Wilbur 2002a, Tanabe and Wilbur 2002b); KeX/PROPER (Fukuda et al.</Paragraph> <Paragraph position="6"> 1997); Yapex (Franzen et al. 2002); the stochastic POS tagging-based system described in Cohen et al. (in submission); and the entity identification component of Ono et al.'s information extraction system (Ono et al.</Paragraph> <Paragraph position="7"> 2001), and show how it gives detailed useful information about each that is not apparent from the standard metrics and that is not documented in the cited publications. (Since we are not interested in punishing system developers for graciously making their work available by pointing out their flaws, we do not refer to the various systems by name in the remainder of this paper.) Software testing techniques can be grouped into structured (Beizer 1990), heuristic (Kaner et al. 2002), and random categories. Testing an EI system by running it on a corpus of texts and calculating precision, recall, and F-score for the results falls into the category of random testing. Random testing is a powerful technique, in that it is successful in finding bugs. When done for the purpose of evaluation, as distinct from testing (see Hirschman and Thompson 1997 for the distinction between the two, referred to there as performance evaluation and diagnostic evaluation), it also is widely accepted as the relevant index of performance for publication. However, its output lacks important information that is useful to a system developer (or consumer): it tells you how often the system failed, but not what it failed at; it tells you how often the system succeeds, but not where its strengths are.</Paragraph> <Paragraph position="8"> For the developer or the user, a structured test suite offers a number of advantages in answering these sorts of questions. The utility of such test suites in general software testing is well-accepted. Oepen et al. (1998) lists a number of advantages of test suites vs.</Paragraph> <Paragraph position="9"> naturalistic corpora for testing natural language processing software in particular: * Control over test data: test suites allow for &quot;focussed and fine-grained diagnosis of system performance&quot; (15). This is important to the developer who wants to know exactly what problems need to be fixed to improve performance, and to the end user who wants to know that performance is adequate on exactly the data that they are interested in.</Paragraph> <Paragraph position="10"> * Systematic coverage: test suites can allow for systematic evaluation of variations in a particular feature of interest. For example, the developer might want to evaluate how performance varies as a function of name length, or case, or the presence or absence of hyphenation within gene symbols.</Paragraph> <Paragraph position="11"> The alternative to using a structured test suite is to use a corpus, and then search through it for the relevant inputs and hope that they are actually attested.</Paragraph> <Paragraph position="12"> * Control of redundancy: while redundancy in a corpus is representative of actual redundancy in inputs, test suites allow for reduction of redundancy when it obscures the situation, or for increasing it when it is important to test handling of a feature whose importance is greater than its frequency in naturally occurring data. For example, names of genes that are similar to names of inherited diseases might make up only a small proportion of the gene names that occur in PubMed abstracts, but the user whose interests lie in curating OMIM might want to be able to assure herself that coverage of such names is adequate, beyond the level to which corpus data will allow.</Paragraph> <Paragraph position="13"> * Inclusion of negative data: in the molecular biology domain, a test suite can allow for systematic evaluation of potential false positives.</Paragraph> <Paragraph position="14"> * Coherent annotation: even the richest metadata is rarely adequate or exactly appropriate for exactly the questions that one wants to ask of a corpus.</Paragraph> <Paragraph position="15"> Generation of structured, feature-based test suites obviates the necessity for searching through corpora for the entities and contexts of interest, and allows instead the structuring of contexts and labeling of examples that is most useful to the developer.</Paragraph> <Paragraph position="16"> The goal of this paper is to describe a methodology and publicly available data set for constructing customized and refinable test suites in the molecular biology domain quickly and easily. A crucial difference between similar work that simply documents a distributable test suite (e.g. Oepen (1998) and Volk (1998)) and the work reported in this paper is that we are distributing not a static test suite, but rather data for generating test suites--data that is structured and classified in such a way as to allow software developers and end users to easily generate test suites that are customized to their own assessment needs and development questions. We build this methodology and data on basic principles of software engineering and of linguistic analysis. The first such principle involves making use of the software testing notion of the catalogue.</Paragraph> <Paragraph position="17"> A catalogue is a list of test conditions, or qualities of particular test inputs (Marick 1997). It corresponds to the features of feature-based testing, discussed in Hirschman and Mani (2003) and to the schedule (Samarin 1967:108-112) of descriptive linguistic technique. For instance, a catalogue of test conditions for numbers might include: Note that the catalogue includes both &quot;clean&quot; conditions and &quot;dirty&quot; ones. This approach to software testing has been highly successful, and indeed the bestselling book on software testing (Kaner et al. 1999) can fairly be described as a collection of catalogues of various types.</Paragraph> <Paragraph position="18"> The contributions of descriptive linguistics include guiding our thinking about what the relevant features, conditions, or categories are for our domain of interest. In this domain, that will include the questions of what features may occur in names and what features may occur in sentences--particularly features in the one that might interact with features in the other. Descriptive linguistic methodology is described in detail in e.g.</Paragraph> <Paragraph position="19"> Harris (1951) and Samarin (1967); in the interests of brevity, we focus on the software engineering perspective here, but the thought process is very similar. The software engineering equivalent of the descriptive linguist's hypothesis is the fault model (Binder 1999)--an explicit hypothesis about a potential source of error based on &quot;relationships and components of the system under test&quot; (p. 1088). For instance, knowing that some EI systems make use of POS tag information, we might hypothesize that the presence of some parts of speech within a gene name might be mistaken for term boundaries (e.g. the of in bag of marbles, LocusID 43038). Catalogues are used to develop a set of test cases that satisfies the various qualities. (They can also be used post-hoc to group the inputs in a random test bed into equivalence classes, although a strong motivation for using them in the first place is to obviate this sort of search-based post-hoc analysis.) The size of the space of all possible test cases can be estimated from the Cartesian product of all catalogues; the art of software testing (and linguistic fieldwork) consisting, then, of selecting the highestyielding subset of this often enormous space that can be run and evaluated in the time available for testing.</Paragraph> <Paragraph position="20"> At least three kinds of catalogues are relevant to testing an EI system. They fall into one of two very broad categories: syntagmatic, having to do with combinatory properties, and paradigmatic, having to do with varieties of content. The three kinds of catalogues are: 1. A catalogue of environments in which gene names can appear. This is syntagmatic.</Paragraph> <Paragraph position="21"> 2. A catalogue of types of gene names. This is paradigmatic.</Paragraph> <Paragraph position="22"> 3. A catalogue of false positives. This is both syntagmatic and paradigmatic.</Paragraph> <Paragraph position="23"> The catalogue of environments would include, for example, elements related to sentence position, such as sentence-initial, sentence-medial, and sentence-final; elements related to list position, such as a single gene name, a name in a comma-separated list, or a name in a conjoined noun phrase; and elements related to typographic context, such as location within parentheses (or not), having attached punctuation (e.g. a sentence-final period) (or not), etc. The catalogue of types of names would include, for example, names that are common English words (or not); names that are words versus &quot;names&quot; that are symbols; single-word versus multi-word names; and so on. The second category also includes typographic features of gene names, e.g. containing numbers (or not), consisting of all caps (or not), etc. We determined candidate features for inclusion in the catalogues through standard structuralist techniques such as examining public-domain databases containing information about genes, including FlyBase, LocusLink, and HUGO, and by examining corpora of scientific writing about genes, and also by the software engineering techniques of &quot;common sense, experience, suspicion, analysis, [and] experiment&quot; (Binder 1999). The catalogues then suggested the features by which we classified and varied the entities and sentences in the data.</Paragraph> <Paragraph position="24"> General format of the data The entities and sentences are distributed in XML format and are available at a supplemental web site (compbio.uchsc.edu/Hunter_lab/testing_ei). A plaintext version is also available. A representative entity is illustrated in Figure 1 below, and a representative sentence is illustrated in Figure 2. All data in the current version is restricted to the ASCII character set. Test suite generation Data sets are produced by selecting sets of entity features and sets of sentential context features and inserting the entities into slots in the sentences. This can be accomplished with the user's own tools, or using applications available at the supplemental web site.</Paragraph> <Paragraph position="25"> The provided applications produce two files: a file containing raw data for use as test inputs, and a file containing the corresponding gold standard data marked up in an SGML-like format. For example, if the raw data file contains the sentence ACOX2 polymorphisms may be correlated with an increased risk of larynx cancer, then the gold standard file will contain the corresponding sentence <gp>ACOX2</gp> polymorphisms may be correlated with an increased risk of larynx cancer. Not all users will necessarily agree on what counts as the &quot;right&quot; gold standard--see Olsson et al. (2002) and the BioCreative site for some of the issues. Users can enforce their own notions of correctness by using our data as input to their own generation code, or by post-processing the output of our applications.</Paragraph> <Paragraph position="26"> contains_punctuation: 1 contains_hyphen: 1 contains_forward_slash: <several punctuation-related features omitted> contains_function_word: function_word_position: contains_past_participle: 1 past_participle_position: i contains_present_participle: present_participle_position: source_authority: HGNC ID: 2681 &quot;Approved Gene Name&quot; field original_form_in_source: death-associated protein 6 data: death-associated protein 6 Figure 1 A representative entry from the entity data file. A number of null-valued features are omitted for brevity--see the full entry at the supplemental web site. The data field (last line of the figure) is what is output by the generation software.</Paragraph> <Paragraph position="27"> ID: 25 type: tp total_number_of_names: 1 list_context: position: I typographic_context: appositive: source_id: PMID: 14702106 source_type: title original_form_in_source: Stat-3 is required for pulmonary homeostasis during hyperoxia.</Paragraph> <Paragraph position="28"> slots: <> is required for pulmonary homeostasis during hyperoxia.</Paragraph> <Paragraph position="29"> Figure 2 A representative entry from the sentences file. Features and values are explained in section 2.2 Feature set for sentential contexts below. The slots field (last line of the figure) shows where an entity would be inserted when generating test data.</Paragraph> </Section> class="xml-element"></Paper>