File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0308_metho.xml

Size: 13,151 bytes

Last Modified: 2025-10-06 14:07:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0308">
  <Title>Unsupervised, corpus-based method for extending a biomedical terminology</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> The approach we propose for discovering candidates for Metathesaurus concepts is to compare phrases extracted from MEDLINE to current UMLS phrases. We capitalize on differences in modification structure between the MEDLINE phrase and the UMLS phrase to determine candidates for inclusion in the Metathesaurus. The crucial difference is between a phrase containing adjectival modification and a similar phrase &amp;quot;demodified&amp;quot; by removing its adjectives.</Paragraph>
    <Paragraph position="1"> A phrase from MEDLINE becomes a candidate term in the Metathesaurus if the following two requirements are met: 1) a demodified term created from this phrase is found in the terminology and 2) similarly modified terms exist in the terminology, for a given semantic category. For example, the phrase pancreatic bronchogenic cyst is a candidate term for a disorder in the Metathesaurus because bronchogenic cyst exists in the Metathesaurus (concept: C0006281) and other Metathesaurus disorder terms are modified by the same adjective pancreatic (e.g., pancreatic hemorrhage).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Processing phrases from MEDLINE
</SectionTitle>
      <Paragraph position="0"> Recently, Srinivasan et al. (2002) performed a shallow syntactic analysis on the entire MEDLINE collection, using only titles and abstracts in English. Although their goal was to find Metathesaurus concepts in MEDLINE citations, an interesting side-effect of their analysis was the production of some 175 million noun phrase types that are available for further research.</Paragraph>
      <Paragraph position="1"> From these phrases, we selected the subset of &amp;quot;simple&amp;quot; phrases, i.e., noun phrases excluding prepositional modification or any other complex feature. Examples of simple MEDLINE noun phrases include abdominal aneurysmal aortitis and radical aggressive tumor resection. Out of some forty million simple noun phrases, we randomly selected a subset of three million phrases to be used as our corpus, representative of the noun phrases found in MEDLINE.</Paragraph>
      <Paragraph position="2"> The phrases in our sample were then submitted to an underspecified syntactic analysis described by Rindflesch et al. (2000) that draws on a stochastic tagger (see Cutting et al. (1992) for details) as well as the SPECIALIST Lexicon, a large syntactic lexicon of both general and medical English that is distributed with the UMLS. Although not perfect, this combination of resources effectively addresses the phenomenon of part-of-speech ambiguity in English.</Paragraph>
      <Paragraph position="3"> The resulting syntactic structure identifies the head and modifiers for the noun phrase analyzed.</Paragraph>
      <Paragraph position="4"> Each modifier is also labeled as being adjectival, adverbial, or nominal. Although all types of modification in the simple English noun phrase were labeled, only adjectives and nouns were selected for further analysis in this study. For example, the term catastrophic cervical spinal cord injuries was analyzed as: [[mod([catastrophic,adj]), mod([cervical,adj]), mod([spinal,adj]), mod([cord,noun]), head([injuries,noun])]] A similar analysis was performed on UMLS terms for disorders and procedures.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Comparing MEDLINE phrases to UMLS
</SectionTitle>
      <Paragraph position="0"> phrases The method we use can be summarized as follows. Starting with a random subset of three million simple noun phrases from MEDLINE, we excluded those that were already present in the UMLS by mapping them to the Metathesaurus. We then performed a shallow syntactic analysis of the phrases in order to select those consisting of one or more modifiers followed by a head noun.</Paragraph>
      <Paragraph position="1"> Demodified terms were created by removing every possible combinations of modifiers in the terms. The same process was applied to disorder and procedure terms in the Metathesaurus in order to obtain a list of allowable adjectival modifiers for these two categories. Such modifiers in the Metathesaurus serve as a filter for MEDLINE phrases, since finding a similarly modified term in the UMLS is one of the two requirements for candidate terms. Demodified terms created from accidental arterial perforations include arterial perforations, accidental perforations, and perforations. null Demodified terms derived from MEDLINE phrases whose modifiers are all allowable are then mapped to the Metathesaurus. In this example, both accidental and arterial are adjectives found in the Metathesaurus in disorder or procedure terms. The second requirement for candidate terms is that at least one associated demodified term be mapped to a concept in the Metathesaurus. Two terms from our example map to Metathesaurus concepts: arterial perforations and perforations. The term accidental perforations does not map to any concept and is therefore eliminated from further processing. The last step ensures that, in case of multiple demodified terms, the finest-grained is selected.</Paragraph>
      <Paragraph position="2"> Arterial perforations is selected over perforations for this reason.</Paragraph>
      <Paragraph position="3"> Figure 1 illustrates the sequence of methods used in the study and the interactions between processing MEDLINE phrases and Metathesaurus terms. It also presents the number of MEDLINE phrases and Metathesaurus terms present before and after each of the six steps detailed below.</Paragraph>
      <Paragraph position="4"> Step1. Mapping phrases to the UMLS In order to identify MEDLINE phrases that already exist in the Metathesaurus, all MEDLINE phrases in our sample were mapped to the UMLS by first attempting an exact match between input term and Metathesaurus concept. If an exact match failed, normalization was then attempted. This process makes the input and target terms potentially compatible by eliminating such inessential differences as inflection, case and hyphen variation, as well as word order variation. Duplicate names were removed from each set prior to mapping to the UMLS.</Paragraph>
      <Paragraph position="5"> Step 2. Identifying (adj+, noun*, head) phrases Since this method is based on adjectival modification, the syntactic analysis was used to restrict the original sets of MEDLINE phrases and Metathesaurus terms to phrases and terms having the following structure: (adj+, noun*, head).</Paragraph>
      <Paragraph position="6"> The phrase is required to start with an adjectival modifier, possibly followed by other adjectives and end with a head noun, possibly preceded by other nouns. This specification excludes both simple terms (e.g., one isolated noun) and complex terms, not suitable for our analysis.</Paragraph>
      <Paragraph position="7"> Step 3. Creating demodified terms When adjectival modifiers are identified in a term O, a set of demodified terms {T1, T2,...,Tn} is created by removing from term O any combinations of adjectival modifiers found in it. While the structure of the demodified terms remains syntactically correct, the semantics of some terms may be anomalous, especially when adjectives other than the leftmost are removed. Since most of them are semantically valid, we found it convenient to keep all demodified terms for further analysis. Demodified terms with incorrect semantics will be filtered out later in the experiment, since they will not map to an existing concept.</Paragraph>
      <Paragraph position="8"> The number of demodified terms is 2m - 1, m being the number of adjectival modifiers. For example, the term chronic sciatic constriction injury starts with the two adjectival modifiers chronic and sciatic, so that the following three demodified terms are generated sciatic constriction injury, chronic constriction injury, and constriction injury. Although there is no need to demodify UMLS terms in this study, the removal of adjectival modifiers was used to establish a list of adjectives occurring in disorder and procedure terms. These adjectives constitute the list of allowable modifiers for the two categories of terms studied.</Paragraph>
      <Paragraph position="9"> Step 4. Searching for similarly modified terms in the Metathesaurus In this study, one requirement for candidate terms is that a similarly modified term be present in the terminology. The list of allowable modifiers computed from Metathesaurus terms at the previous step provides a simple way to implement this constraint. For a given category, an allowable modifier indicates that some terms from this category are modified by this modifier, i.e., that a similarly modified term exists in the Metathesaurus.</Paragraph>
      <Paragraph position="10"> Practically, MEDLINE phrases whose adjectival modifiers do not all belong to the list of allowable modifiers are excluded from further analysis, because, by definition, there will be no similarly modified term in the Metathesaurus.</Paragraph>
      <Paragraph position="11"> Step 5. Searching for demodified terms in the</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Metathesaurus
</SectionTitle>
      <Paragraph position="0"> The second requirement for a MEDLINE phrase to become a candidate term is that a demodified term created from this phrase be found in the terminology. Using only MEDLINE phrases whose adjectival modifiers all belong to the list of allowable modifiers, the demodified terms created from these phrases are mapped to the UMLS using the procedure previously described. MEDLINE phrases with no demodified term mapped to a UMLS concept are definitely excluded. Demodified terms mapping to concepts in categories other than disorders or procedures are also eliminated.</Paragraph>
      <Paragraph position="1"> As explained earlier, the compatibility of the modifiers of the candidate terms is checked against the list of allowable modifiers for the category of the Metathesaurus concept(s) to which a demodified term mapped. In some cases, a cadidate term is eliminated because the modified term maps to a disorder concept, while its modifiers are compatible with procedures (or the other way around).</Paragraph>
      <Paragraph position="2"> Step 6. Hooking candidate terms to the terminology null The remaining step consists of finding the appropriate hook in the terminology for the candidate term. Based on the fact that modification is normally associated with a hyponymic relation, tentative parents for the candidate term will be those that map to the demodified terms generated from this term.</Paragraph>
      <Paragraph position="3"> When only one demodified term maps to a Metathesaurus concept, this concept is selected as the tentative parent for the candidate term. When several demodified terms map to Metathesaurus concepts, the preference is given to the concept that is likely to be closest to the term. As a surrogate for closeness, we use the following heuristics: 1) the fewer modifiers removed, the closer the terms, and 2) the candidate term and the demodified term are closer if the modifier removed is the leftmost modifier. In the rare cases where several demodified terms are deemed equally close to the candidate term, they are all selected as tentative parents.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> A subset of 1000 candidate terms was randomly selected to evaluate this method. The existence of a hyponymic relationship between the candidate term and the Metathesaurus concept(s) selected as valid mappings for the demodified terms created from the candidate term was evaluated by a manual review performed by the authors. A secondary objective of this evaluation was to gain insights about how these methods could be tuned in order to prevent inaccurate mappings and select the most useful candidate terms.</Paragraph>
    <Paragraph position="1"> The following classification was used to describe the quality of the hyponymic relationship between the candidate term and the Metathesaurus concept(s) selected: &amp;quot;relevant&amp;quot; means that the hooking of the candidate term to the terminology was relevant, even if a more specific concept was available; &amp;quot;non relevant&amp;quot; means that none of the Metathesaurus concepts selected was a correct hook for the candidate term; &amp;quot;more or less relevant&amp;quot; means that the Metathesaurus concepts selected were not irrelevant as hooks, but were distant ancestors, i. e., too general for the relationship to be fully informative. Finally, for polysemous candidate terms, it was not possible to evaluate the quality of the relationship with certainty. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML