File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1302_intro.xml

Size: 4,550 bytes

Last Modified: 2025-10-06 14:03:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1302">
  <Title>Adaptive String Similarity Metrics for Biomedical Reference Resolution</Title>
  <Section position="3" start_page="9" end_page="10" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
2.1 Entity Extraction and Reference
</SectionTitle>
      <Paragraph position="0"> Resolution in the Biomedical Domain Most of the work related to reference resolution in this domain has been done in the following areas: a) Intra-document Reference resolution, e.g (Casta~no et al., 2002; Lin and Liang, 2004) b) Intra-document Named entity recognition (e.g Biocreative Task 1A (Blaschke et al., 2003), and others), also called classification of biological names (Torii et al., 2004) c) Intra-document alias extraction d) cross-document Acronym-expansion extraction, e.g., (Pustejovsky et al., 2001). e) Protein names resolution against database entries in SwissProt, protein name grounding, in the context of a relation extraction task (Kim and Park, 2004). One constraint in these approaches is that they use several patterns for the string matching problem. The results of the protein name grounding are 59% precision and 40% recall.</Paragraph>
      <Paragraph position="1"> The Biocreative Task 1B task challenged systems to ground entities found in article abstracts which contain mentions of genes in Fly, Mouse and Yeast databases. A central component in this task was resolving ambiguity as many gene names refer to multiple genes.</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
2.2 String Similarity and Ambiguity
</SectionTitle>
      <Paragraph position="0"> In this subsection consider the string similarity issues that are present in the biology domain in particular. The task we consider is to associate a string with an existing entity, represented by a set of known strings. Although the issue of ambiguity is present in the examples we give, it cannot be resolved by using string similarity methods alone, but instead by methods that take into account the context in which those strings occur.</Paragraph>
      <Paragraph position="1"> The protein name p21 is ambiguous at least between two entities, mentioned as p21-ras and p21/Waf in the literature. A biologist can look at a set of descriptions and decide whether the strings are ambiguous or correspond to any of these two (or any other entity).</Paragraph>
      <Paragraph position="2"> The following is an example of such a mapping, where R corresponds to p21-ras, W to p21(Waf) and G to another entity (the gene). Also it can be noticed that some of the mappings include subcases (e.g.,  If we want to use an external knowlege source to produce such a mapping, we can try to map it to concepts in the UMLS Methatesaurus and entries in the SwissProt database.</Paragraph>
      <Paragraph position="3"> These two entities correspond to the concepts C0029007 (p21-Ras) and C0288472 (p21-Waf) in the UMLS Methathesaurus. There are 27 strings or names in the UMLS that map to C0288472 (Table  It can be observed that there is only one exact match: p21 in C0288472 and Table 1. It should be noted that p21, is not present in the UMLS as a possible string for C0029007. There are other close matches like p21(Waf1/Cip1) (which seems very frequent) and p21(waf1-cip1).</Paragraph>
      <Paragraph position="4"> An expression like The inhibitor of cyclin-dependent kinases WAF1 gene product p21 has a high similarity with Cyclin-Dependent Kinase Inhibitor 1 A and The cyclin-dependent kinase-I p21(Waf-1) partially matches Cyclin-Dependent Kinase null However there are other mappings which look quite difficult unless some context is given to provide additional clues (e.g., v-p21).</Paragraph>
      <Paragraph position="5"> The SwissProt entries CDN1A FELCA, CDN1A HUMAN and CDN1A MOUSE are related to p21(Waf). They have the following set of common description names: Cyclin-dependent kinase inhibitor 1, p21, CDKinteracting protein 1.3 There is only one entry in SwissProt related to p21ras: Q9PSS8 PLAFE: with the description name P21-ras protein and a related gene name: Ki-ras. It should be noted that SwissProt classifies, as different entities, the proteins that refer to different organisms. The UMLS MetaThesaurus, on the other hand, does not make this distinction. Neither is this distinction always present in the literature.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML