File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2083_intro.xml

Size: 3,294 bytes

Last Modified: 2025-10-06 14:03:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2083">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Term Recognition Approach to Acronym Recognition</Title>
  <Section position="3" start_page="0" end_page="643" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the biomedical literature the amount of terms (names of genes, proteins, chemical compounds, drugs, organisms, etc) is increasing at an astounding rate. Existing terminological resources and scientific databases (such as Swiss-Prot1, SGD2, FlyBase3, and UniProt4) cannot keep up-to-date with the growth of neologisms (Pustejovsky et al., 2001). Although curation teams maintain terminological resources, integrating neologisms is very difficult if not based on systematic extraction and [?]Research Fellow of the Japan Society for the Promotion  collection of terminology from literature. Term identification in literature is one of the major bottlenecks in processing information in biology as it faces many challenges (Ananiadou and Nenadic, 2006; Friedman et al., 2001; Bodenreider, 2004).</Paragraph>
    <Paragraph position="1"> The major challenges are due to term variation, e.g. spelling, morphological, syntactic, semantic variations (one term having different termforms), term synonymy and homonymy, which are all central concerns of any term management system.</Paragraph>
    <Paragraph position="2"> Acronyms are among the most productive type of term variation. Acronyms (e.g. RARA) are compressed forms of terms, and are used as substitutes of the fully expanded termforms (e.g., retinoic acid receptor alpha). Chang and Sch&amp;quot;utze (2006) reported that, in MEDLINE abstracts, 64,242 new acronyms were introduced in 2004 with the estimated number being 800,000.</Paragraph>
    <Paragraph position="3"> Wren et al. (2005) reported that 5,477 documents could be retrieved by using the acronym JNK while only 3,773 documents could be retrieved by using its full term, c-jun N-terminal kinase.</Paragraph>
    <Paragraph position="4"> In practice, there are no rules or exact patterns for the creation of acronyms. Moreover, acronyms are ambiguous, i.e., the same acronym may refer to different concepts (GR abbreviates both glucocorticoid receptor and glutathione reductase).</Paragraph>
    <Paragraph position="5"> Acronyms also have variant forms (e.g. NF kappa B, NF kB, NF-KB, NF-kappaB, NFKB factor for nuclear factor-kappa B). Ambiguity and variation present a challenge for any text mining system, since acronyms have not only to be recognised, but their variants have to be linked to the same canonical form and be disambiguated.</Paragraph>
    <Paragraph position="6"> Thus, discovering acronyms and relating them to their expanded forms is important for terminology management. In this paper, we present a term recognition approach to construct an acronym dic- null tionary from a large text collection. The proposed method focuses on terms appearing frequently in the proximity of an acronym and measures the likelihood scores of such terms to be the expanded forms of the acronyms. We also describe an algorithm to combine the proposed method with a conventional letter-based method for acronym recognition. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML