File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3110_metho.xml
Size: 17,415 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3110"> <Title>A Large Scale Terminology Resource for Biomedical Text Processing</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Architecture </SectionTitle> <Paragraph position="0"> Termino consists of two components: a database holding terminological information and a compiler for generating term recognizers from the contents of the database. These two components will be discussed in the following two sections.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> STRINGS </SectionTitle> <Paragraph position="0"> string str id</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Terminological Database </SectionTitle> <Paragraph position="0"> The terminological database is designed to meet three requirements. First of all, it must be capable of storing large numbers of terms. As we have seen, the UMLS Metathesaurus contains over 2 million distinct terms. However, as UMLS is just one of many resources whose terms may need to be stored, many millions of terms may need to be stored in total. Secondly, Termino's database must also be flexible enough to hold a variety of information about terms, including information of a morpho-syntactic nature, such as part of speech and morphological class; information of a semantic nature, such as quasi-logical form and links to concepts in ontologies; and provenance information, such as the sources of the information in the database. The database will also contain links to connect synonyms and morphological and orthographic variants to one another and to connect abbreviations and acronyms to their full forms. Finally, the database must be organized in such a way that it allows for fast and efficient recognition of terms in text.</Paragraph> <Paragraph position="1"> As mentioned above, the information in Termino's database is either imported from existing, outside knowledge sources or induced from text corpora. Since these sources are heterogeneous in both information content and format, Termino's database is &quot;extensional&quot;: it stores strings and information about strings. Higher-order concepts such as &quot;term&quot; emerge as the result of interconnections between strings and information in the database. The database is organized as a set of relational tables, each storing one of the types of information mentioned above. In this way, new information can easily be included in the database without any global changes to the structure of the database.</Paragraph> <Paragraph position="2"> Terminological information about any given string is usually gathered from multiple sources. As information about a string accumulates in the database, we must make sure that co-dependencies between various pieces of information about the string are preserved. This consideration leads to the fundamental element of the terminological database, a termoid. A termoid consists of a string together with associated information of various kinds about the string. Information in one termoid holds conjunctively for the termoid's string, while multiple termoids for the same string express disjunctive alternatives.</Paragraph> <Paragraph position="3"> For instance, taking an example from UMLS, we may learn from one source that the string cold as an adjective refers to a temperature, whereas another source may tell us that cold as a noun refers to a disease. This information is stored in the database as two termoids: abstractly, 'cold, adjective, temperature' and 'cold, noun, disease'.</Paragraph> <Paragraph position="4"> A single termoid 'cold, adjective, noun, temperature, disease' would not capture the co-dependency between the part of speech and the &quot;meaning&quot; of cold.1 This example illustrates that a string can be in more than one termoid.</Paragraph> <Paragraph position="5"> 1Note that the UMLS Metathesaurus has no mechanism for storing this co-dependency between grammatical and semantic information.</Paragraph> <Paragraph position="6"> Each termoid, however, has one and only one string.</Paragraph> <Paragraph position="7"> Figure 2 provides a detailed example of part of the structure of the terminological database. In the table STRINGS every unique string is assigned a string identifier (str id). In the table TERMOID STRINGS each string identifier is associated with one or more termoid identifiers (trm id). These termoid identifiers then serve as keys into the tables holding terminological information.</Paragraph> <Paragraph position="8"> Thus, in this particular example, the database includes the information that in the Gene Ontology the string neurofibromin has been assigned the terms with identifiers GO:0004857 and GO:0008285. Furthermore, in the UMLS Metathesaurus version 2003AC, the string mammectomy has been assigned the concept-unique identifier C0024881 (CUI), the lemma-unique identifier L0024669 (LUI), and the string-unique identifier S0059711 (SUI).</Paragraph> <Paragraph position="9"> Connections between termoids such as those arising from synonymy and orthographic variation are recorded in another set of tables. For example, the table SYN-ONYMY in figure 2 indicates that termoids 278 and 627 are synonymous, since they have the same synonymy class identifier (scl id).2 The synonymy identifier (syn id) identifies the assignment of a termoid to a particular synonymy class. This identifier is used to record the source on which the assignment is based. This can be a reference to a knowledge source from which synonymy information has been imported into Termino, or a reference to both an algorithm by which and a corpus from which synonyms have been extracted. Similarly there are tables containing provenance information for strings, indexed by str id, and termoids, indexed by trm id. These tables are not shown in he example.</Paragraph> <Paragraph position="10"> With regard to the first requirement for the design of the terminological database mentioned at the beginning of this section - scalability -, an implementation of Termino in MySQL has been loaded with 427,000 termoids for 363,000 strings (see section 4 for more details). In it the largest table, STRINGS, measures just 16MB, which is nowhere near the default limit of 4GB that MySQL imposes on the size of tables. Hence, storing a large number of terms in Termino is not a problem size-wise. The second requirement, flexibility of the database, is met by distributing terminological information over a set of relatively small tables and linking the contents of these tables to strings via termoid identifiers. In this way we avoid the strictures of any one fixed representational scheme, thus making it possible for the database to hold information from disparate sources. The third requirement on the design of the database, efficient recognition of terms, will 2The function of synonymy class identifiers in Termino is similar to the function of CUIs in the UMLS Metathesaurus.</Paragraph> <Paragraph position="11"> However, since we are not bound to a classification into UMLS CUIs, we can assert synonymy between terms coming from arbitrary sources.</Paragraph> <Paragraph position="12"> be addressed in the next section.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Term Recognition </SectionTitle> <Paragraph position="0"> To ensure fast term recognition with Termino's vast terminological database, the system comes equipped with a compiler for generating finite state machines from the strings in the terminological database discussed in the previous section. Direct look-up of strings in the database is not an option, because it is unknown in advance at which positions in a text terms will start and end. In order to be complete, one would have to look up all sequences of words or tokens in the text, which is very inefficient.</Paragraph> <Paragraph position="1"> Compilation of a finite state recognizer proceeds in the following way. First, each string in the database is broken into tokens, where a token is either a contiguous sequence of alpha-numeric characters or a punctuation symbol. Next, starting from a single initial state, a path through the machine is constructed, using the tokens of the string to label transitions. For example, for the string Graves' disease the machine will include a path with transitions on Graves, ', and disease. New states are only created when necessary. The state reached on the final token of a string will be labeled final and is associated with the identifiers of the termoids for that string.</Paragraph> <Paragraph position="2"> To recognize terms in text, the text is tokenized and the finite state machine is run over the text, starting from the initial state at each token in the text. For each sequence of tokens leading to a final state, the termoid identifiers associated with that state are returned. These identifiers are then used to access the terminological database and retrieve the information contained in the termoids. Where appropriate the machine will produce multiple termoid identifiers for strings. It will also recognize overlapping and embedded strings.</Paragraph> <Paragraph position="3"> Figure 3 shows a small terminological database and a finite state recognizer derived from it. Running this recognizer over the phrase . . . thyroid dysfunction, such as Graves' disease . . . produces four annotations: thyroid is assigned the termoid identifiers trm1 and trm2; thyroid dysfunction, trm3; and Graves' disease, trm4.</Paragraph> <Paragraph position="4"> It should be emphasised at this point that term recognition as performed by Termino is in fact term look-up and not the end point of term processing. Term look-up might return multiple possible terms for a given string, or for overlapping strings, and subsequent processes may apply to filter these alternatives down to the single option that seems most likely to be correct in the given context.</Paragraph> <Paragraph position="5"> Furthermore, more flexible processes of term recognition might apply over the results of look-up. For example, a term grammar might be provided for a given domain, allowing longer terms to be built from shorter terms that have been identified by term look-up.</Paragraph> <Paragraph position="6"> The compiler can be parameterized to produce finite state machines that match exact strings only, or that ab- null stract away from morphological and orthographical variation. At the moment, morphological information about strings is supplied by a component outside Termino. In our current term recognition system, this component applies to a text before the recognition process and associates all verbs and nouns with their base form. Similarly, the morphological component applies to the strings in the terminological database before the compilation process.</Paragraph> <Paragraph position="7"> The set-up in which term recognizers are compiled from the contents of the terminological database turns Termino into a general terminological resource which is not restricted to any single domain or application. The database can be loaded with terms from multiple domains and compilation can be restricted to particular subsets of strings by selecting termoids from the database based on their source, for example. In this way one can produce term recognizers that are tailored towards specific domains or specific applications within domains.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Implementation & Performance </SectionTitle> <Paragraph position="0"> A first version of Termino has been implemented. It uses a database implemented in MySQL and currently contains over 427,000 termoids for around 363,000 strings.</Paragraph> <Paragraph position="1"> Content has been imported from various sources by means of source-specific scripts for extracting relevant information from sources and a general script for loading this extracted information into Termino. More specifically, to support information extraction from patient records, we have included in Termino strings from the UMLS Metathesaurus falling under the following semantic types: pharmacologic substances, anatomical structures, therapeutic procedure, diagnostic procedure, and several others. We have also loaded a list of human proteins and their assignments to the Gene Ontology as produced by the European Bioinformatics Institute (http://www.ebi.ac.uk/GOA/) into Termino. Furthermore, we have included several gazetteer lists containing terms in the fields of molecular biology and pharmacology that were assembled for previous information extraction projects in our NLP group. A web services (SOAP) API to the database is under development. We plan to make the resource available to researchers as a web service or in downloadable form.3 The compiler to construct finite state recognizers from the database is fully implemented, tested, and integrated into AMBIT. The compiled recognizer for the 363,000 strings of Termino has 1.2 million states and an on-disk size of around 80MB. Loading the matcher from disk into memory requires about 70 seconds (on an UltraSparc 900MHz), but once loaded recognition is a very fast process. We have been able to annotate a corpus of 114,200 documents, drawn from electronic patient records from the Royal Marsden NHS Trust in London and each approximately 1kB of text, in approximately 44 hours - an average rate of 1.4 seconds per document, or 42 documents per minute. On average, about 30 terms falling under the UMLS 'clinical' semantic types mentioned above were recognized in each document. We are currently annotating a bench-mark corpus in order to obtain precision and recall figures. We are also planning to compile recognizers for differently sized subsets of the terminological database and measure their recognition speed over a given collection of texts. This will provide some indication as to the scalability of the system.</Paragraph> <Paragraph position="2"> Since Termino currently contains many terms imported from the UMLS Metathesaurus, it is interesting to compare its term recognition performance against the performance of MetaMap. MetaMap is a program available from at the National Library of Medicine - the developers of UMLS - specifically designed to discover UMLS Metathesaurus concepts referred to in text (Aronson, 2001). An impressionistic comparison of the performance of Termino and MetaMap on the CLEF patient records shows that the results differ in two ways. First, MetaMap recognizes more terms than Termino. This is simply because MetaMap draws on a comprehensive version of UMLS, whereas Termino just contains a selected subset of the strings in the Metathesaurus. Secondly, MetaMap is able to recognize variants of terms, e.g. it will map the verb to treat and its inflectional forms onto the term treatment, whereas Termino currently does not do this. To recognize term variants MetaMap relies on UMLS's SPECIALIST lexicon, which provides 3Users may have to sign license agreements with third parties in order to be able to use restricted resources that have been integrated into Termino.</Paragraph> <Paragraph position="3"> syntactic, morphological, and orthographic information for many of the terms occurring in the Metathesaurus.</Paragraph> <Paragraph position="4"> While the performance of both systems differs in favor of MetaMap, it is important to note that the source of these differences is unrelated to the actual design of Termino's terminological database or Termino's use of finite state machines to do term recognition. Rather, the divergence in performance follows from a difference in breadth of content of both systems at the moment. With regard to practical matters, the comparison showed that term recognition with Termino is much faster than with MetaMap. Also, compiling a finite state recognizer from the terminological database in Termino is a matter of minutes, whereas setting up MetaMap can take several hours.</Paragraph> <Paragraph position="5"> However, since MetaMap's processing is more involved than Termino's, e.g. MetaMap parses the input first, and hence requires more resources, these remarks should be backed up with a more rigorous comparison between Termino and MetaMap, which is currently underway.</Paragraph> <Paragraph position="6"> The advantage of term recognition with Termino over MetaMap and UMLS or any other recognizer with a single source, is that it provides immediate entry points into a variety of outside ontologies and other knowledge sources, making the information in these sources available to processing steps subsequent to term recognition. For example, for a gene or protein name recognized in a text, Termino will return the database identifiers of this term in the HUGO Nomenclature database (Wain et al., 2002) and the OMIM database (Online Mendelian Inheritance in Man, OMIM (TM), 2000). These identifiers give access to the information stored in these databases about the gene or protein, including alternative names, gene map locus, related disorders, and references to relevant papers.</Paragraph> </Section> class="xml-element"></Paper>