File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0308_intro.xml

Size: 4,403 bytes

Last Modified: 2025-10-06 14:01:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0308">
  <Title>Unsupervised, corpus-based method for extending a biomedical terminology</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> Terms play a major role in a variety of natural language processing (NLP) applications, including machine translation, text understanding, automatic indexing, and information retrieval. Taking advantage of the availability of large corpora, automatic terminology acquisition methods were developed, for example, by Bourigault and Jacquemin (1999).</Paragraph>
    <Paragraph position="1"> Word affinities generally play a central role in these methods. Grefenstette (1994) defines three Association for Computational Linguistics.</Paragraph>
    <Paragraph position="2"> the Biomedical Domain, Philadelphia, July 2002, pp. 53-60. Proceedings of the Workshop on Natural Language Processing in orders of word affinities. &amp;quot;First order affinities describe collocates of words, second-order affinities show similarly used words, and third-order affinities create semantic groupings among similar words&amp;quot;. In term extraction, this analysis is often applied to modifiers in order to establish groups of terms modified by a given modifier or the list of all possible modifiers for a given term.</Paragraph>
    <Paragraph position="3"> Hersh et al. (1996) demonstrated the feasibility of applying natural language processing techniques to a corpus of clinical narratives from an electronic medical record (EMR) system. Although the terms extracted were compared to existing terms in the UMLS, the goal of this study was vocabulary discovery, but not the automatic integration of newly discovered terms into the terminology.</Paragraph>
    <Paragraph position="4"> The automatic extension of an existing resource based on a corpus has also been studied. For example, Habert et al. (1998) propose a method for extending an existing specialized semantic lexicon.</Paragraph>
    <Paragraph position="5"> Although related to these studies, our objective is to automatically extend downwards an existing biomedical terminology using a corpus and a combination of lexical, syntactic, and terminological knowledge.</Paragraph>
    <Paragraph position="6"> In this study, the textual source, or corpus, is MEDLINE(r) 1, the U.S. National Library of Medicine's (NLM) premier bibliographic database.</Paragraph>
    <Paragraph position="7"> MEDLINE contains over eleven million references to articles from more than 4,600 worldwide journals in life sciences with a concentration on biomedicine. null We use the Unified Medical Language System(r) (UMLS(r)) Metathesaurus(r) 2 as the terminology to be extended. The Metathesaurus, also developed by NLM, is organized by concept or meaning. A concept is defined as a cluster of terms representing the same meaning (synonyms, lexical variants, acronyms, translations). For example, names for the disease multiple sclerosis include multiple sclerosis, MS, 'multiple sclerosis, NOS', disseminated sclerosis, and sclerose en plaques. The 13th edition (2002) of the UMLS Metathesaurus contains over  In order to address the large size of the Metathesaurus, we limited our study to a significant subdomain of clinical medicine: disorders and procedures (currently about 615,000 unique terms, corresponding to some 157,000 disorder concepts and 95,000 medical procedure concepts).</Paragraph>
    <Paragraph position="8"> In the UMLS, each concept is categorized by semantic types (ST) from the semantic network.</Paragraph>
    <Paragraph position="9"> McCray et al. (2001) designed groupings of STs that provide a partition the Metathesaurus and, therefore, can be used to extract consistent sets of concepts corresponding to a subdomain, such as disorders or procedures.</Paragraph>
    <Paragraph position="10"> Disorder and procedure terms were restricted to terms suitable for natural language processing, excluding, for example, such terms as abdominal injury, NOS. The notation &amp;quot;NOS&amp;quot;, meaning &amp;quot;not otherwise specified&amp;quot;, is a marker for underspecification often found in terminological resources. When identified in the Metathesaurus, obsolete and truncated terms were also excluded. 477,491 unique terms were selected for further processing.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML