File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/02/w02-0302_relat.xml

Size: 2,474 bytes

Last Modified: 2025-10-06 14:15:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0302">
  <Title>Tagging Gene and Protein Names in Full Text Articles</Title>
  <Section position="4" start_page="0" end_page="0" type="relat">
    <SectionTitle>
3 Experiment and Results
</SectionTitle>
    <Paragraph position="0"> We evaluated the performance of ABGene on 2600 PMC sentences from 13 score levels ranging from -8 to 60+. No attempt was made to narrow the set using query terms. The sentences were selected as follows: half of the test set consists of the first 100 sentences from each score level, and the other half consists of 100 sentences selected at random from each score level. Precision and recall results are shown for each individual score range in Table 2, and cumulative results are shown in Table 3.</Paragraph>
    <Paragraph position="1"> The number of words tested varies for each score level because longer sentences tend to have higher scores. Also, sentences with scores near zero tend to be table or figure entries, with only a few words each.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Problematic Areas in Full Text
</SectionTitle>
      <Paragraph position="0"> The false positive gene/protein names found in the PMC articles reveal new difficulties for the basic task of identifying gene and protein names in biomedical text. For example, in abstracts, entities like restriction enzyme sites, laboratory protocol kits, primers, vectors, molecular biology supply companies and chemical reagents are usually scarce. However, in the methods section of a full document, they appear regularly, adding to the morphological, syntactic and semantic ambiguities previously mentioned.</Paragraph>
      <Paragraph position="1"> Illustrative examples include bio-rad, centricon30 spin, xbai sites, mg2, geneamp and pgem3z. A significant source of false negatives consists of tables and figures from full text, which completely lack contextual cues and/or indicator words. These problems can be addressed by eliminating processing of materials and methods sections, tables and figures. Another significant source of false negatives is an artifact of the PMC format, for example, beta is translated to [beta], thus a name like beta1 integrin becomes [beta]1 integrin in PMC. This is easily addressed by removing the PMC formatting prior to processing, and has already been completed for future work on PMC articles.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML