File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-0303_concl.xml

Size: 2,386 bytes

Last Modified: 2025-10-06 13:53:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0303">
  <Title>Contrast And Variability In Gene Names</Title>
  <Section position="5" start_page="864" end_page="864" type="concl">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> Entity identification is a difficult task whose success is partly dependent on performance in other tasks, including disambiguation and information retrieval. Disambiguation of the actual referent of an apparent gene or protein name is even more important than one might expect. Hatzivassiloglou et al. (2001) points out the benefits and the difficulties inherent in distinguishing between genes, proteins, and RNA; we found that it was also important to differentiate between genes, proteins, RNA, and receptors, promoters, antagonists, domains, and binding sites, as well as diseases, syndromes, conditions, phenotypes, and mutants, as all of these were noted by our subject-matter expert as sources of false positives. Good information retrieval is clearly also a prerequisite for high-precision entity identification. In some cases, false positives arose when (abstracts of) irrelevant documents were used as input.</Paragraph>
    <Paragraph position="1"> Heuristics can be useful tools for increasing recall in entity identification, as well as for helping us ensure that we are performing true entity identification, as opposed to entity location. Tanabe and Wilbur (in press) point out the value of combining knowledge sources in the entity identification task; our heuristics seem especially promising in part because they are based on a combination of two sources: (1) the expertise of NLP application developers about the sorts of variability that need to be dealt with in NLP systems (e.g. in text normalization), and (2) on empirical data about variability in the names themselves. Future work should concentrate on three areas. The first is extending our study of variability to include other dimensions of contrast, such as the ones that we point out that our study ignored, so that we can increase the inventory of heuristics. The second is integrating our heuristics with a system that identifies weak matches with gene names, i.e. candidates for application of the heuristics. The third is elucidating the place of orthographic variability within all causes of pattern match failure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML