File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-3110_concl.xml
Size: 3,877 bytes
Last Modified: 2025-10-06 13:54:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3110"> <Title>A Large Scale Terminology Resource for Biomedical Text Processing</Title> <Section position="7" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Conclusions & Future Work </SectionTitle> <Paragraph position="0"> Dealing with terminology is an essential step in natural language processing in technical domains. In this paper we have described the design, implementation, and use of Termino, a large scale terminology resource for biomedical language processing.</Paragraph> <Paragraph position="1"> Termino includes a relational database which is designed to store a large number of terms together with complex, heterogeneous information about these terms, such as morpho-syntactic information, links to concepts in ontologies, and other kinds of annotations. The database is also designed to be extensible: it is easy to include terms and information about terms found in outside biological databases and ontologies. Term look-up in text is done via finite state machines that are compiled from the contents of the database. This approach allows the database to be very rich without sacrificing speed at look-up time. These three features make Termino a flexible tool for inclusion in a biomedical text processing system. null As noted in section 3.2, Termino has not been designed to be used as a stand-alone term recognition system but rather as the first component, the lexical look-up component, in a multi-component term processing system.</Paragraph> <Paragraph position="2"> Since Termino may return multiple terms for a given string, or for overlapping strings, some post-filtering of these alternatives is necessary. Secondly, it is likely that better term recognition performance will be obtained by supplementing Termino look-up with a term parser which uses a grammar to give a term recognizer the generative capacity to recognize previously unseen terms. For example, many terms for chemical compounds conform to grammars that allow complex terms to be built out of simpler terms prefixed or suffixed with numerals separated from the simpler term with hyphens. It does not make sense to attempt to store in Termino all of these variants. Termino provides a firm basis on which to build large-scale biomedical text processing applications. However, there are a number of directions where further work can be done. First, as noted in 3.2, morphological information is currently not held in Termino, but rather resides in an external morphological analyzer. We are working to extend the Termino data model to enable information about morphological variation to be stored in Termino, so that Termino serves as a single source of information for the terms it contains. Secondly, we are working to build term induction modules to allow Termino content to be automatically acquired from corpora, in addition to deriving it from manually created resources such as UMLS. Finally, while we have already incorporated Termino into the AMBIT system where it collaborates with a term parser to perform more complete term recognition, more work can be done to with respect to such an integration. For example, probabilities could be incorporated into Termino to assist with probabilistic parsing of terms; or, issues of trade-off between what should be in the term lexicon versus the term grammar could be further explored by looking to see which compound terms in the lexicon contain other terms as substrings and attempt to abstract away from these to grammar rules. For example, in the example thyroid disfunction above, both thyroid and disfunction are terms, the first of class 'body part', the second of class 'problem'. Their combination thyroid disfunction is a term of class 'problem', suggesting a rule of the form 'problem' a0 'body part' 'problem'.</Paragraph> </Section> class="xml-element"></Paper>