File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1403_intro.xml
Size: 5,618 bytes
Last Modified: 2025-10-06 14:01:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1403"> <Title>Lexically-Based Terminology Structuring: Some Inherent Limits</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 The MeSH biomedical thesaurus, and </SectionTitle> <Paragraph position="0"> associated morphological knowledge We first present the existing hierarchically structured thesaurus, a 'stop word' list and morphological knowledge involved in the present work.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The MeSH biomedical thesaurus The Medical Subject Headings (MeSH, </SectionTitle> <Paragraph position="0"> NLM (2001a)) is one of the main international medical terminologies (see, e.g., Cimino (1996) for a presentation of medical terminologies). It is a thesaurus specifically designed for information retrieval in the biomedical domain. The MeSH is used to index the international biomedical literature in the Medline bibliographic database. The French version of the MeSH (INSERM, 2000) contains a translation of these terms (19,638 terms) plus synonyms. It happens to be written in unaccented, uppercase letters. Both the American and French MeSH can be found in the UMLS Metathesaurus (NLM, 2001b), which can be obtained through a convention with the National Library of Medicine.</Paragraph> <Paragraph position="1"> The concept names (main headings) which the MeSH contains have been designed to reflect their broad meanings and to facilitate their use by human indexers and librarians. In that, they follow a tradition in information sciences, and are not necessarily the expressions used in naturally occurring biomedical documents. The MeSH can be considered as a fine-grained thesaurus: concepts are chosen to insure a good coverage of the biomedical domain (Zweigenbaum, 1999).</Paragraph> <Paragraph position="2"> As many other medical terminologies, the MeSH has a hierarchical structure: 'narrower' concepts (children) are related to 'broader' concepts (parents). This both covers the usual is-a relation and partitive relations (part-of, conceptual-part-of and process-of ). The MeSH also includes see-also relations, which we do not take into account in the present experiments. This structure has also been designed in the aim to be intellectually accessible to users: an indexer must be able to assign a given concept to an article and a clinician must be able to find a given concept in the tree hierarchy (Nelson et al., 2001). To conclude, the MeSH team aims to organize it in a clear and intuitive manner, both for concept naming and concept placement.</Paragraph> <Paragraph position="3"> The version of the French MeSH we used in these experiments contains 19,638 terms, 26,094 direct child-to-parent links and (under transitive closure) 95,815 direct or indirect child-to-ancestor links.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Stop word list </SectionTitle> <Paragraph position="0"> The aim of using a 'stop word' list is to remove from term comparison very frequent words which are considered not to be content-bearing, hence 'nonsignificant' for terminology structuring. We used in this experiment a short stop word list (15 word forms). It contains the few frequent grammatical words, such as articles and prepositions, that occur in MeSH terms.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Morphological knowledge </SectionTitle> <Paragraph position="0"> The morphological knowledge involved consists of lemma/derived-word or lemma/inflected form pairs where the first is the 'normalized' form and the second a 'variant' form.</Paragraph> <Paragraph position="1"> Inflection produces the various forms of a given word such as plural, feminine or the multiple forms of a verb according to person, tense, etc.: intervention - interventions, acid - acids. We perform the reverse process (lemmatization), reducing an inflected form to its lemma (canonical form).</Paragraph> <Paragraph position="2"> We worked with two alternate lexicons. The first one is based on a general French lexicon (ABU, abu.cnam.fr/DICO) which we have augmented with pairs obtained from medical corpora processed through a tagger/lemmatizer (in cardiology, hematology, intensive care, and drug monographs): it totals 219,759 pairs (where the inflected form is different from the lemma). The second lexicon, more specialized and tuned to the vocabulary in medical terminologies, is the result of applying rules acquired in previous work from two other medical terminologies (ICD-10 and SNOMED) to the vocabulary in the MeSH, ICD-10 and SNOMED (total: 2,889 pairs).</Paragraph> <Paragraph position="3"> Derivation produces, e.g., the adjectival form of a noun (noun aorta a0 adjective aortic), the nominal form of a verb (verb intervene a0 noun intervention), or the adverbial form of an adjective (adjective human a0 adverb humanely). We perform linguistically-motivated stemming to reduce a derived word to its base word. For derivation, we also used resources acquired in previous work which, once combined with inflection pairs, results in 4,517 pairs.</Paragraph> <Paragraph position="4"> Compounding, which combines several radicals, often of Greek or Latin origin, to obtain complex words (e.g., aorta + coronary yields aortocoronary), has not been used because we do not have a reliable procedure to segment a compound into its component morphemes.</Paragraph> </Section> </Section> class="xml-element"></Paper>