File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-1301_abstr.xml

Size: 2,218 bytes

Last Modified: 2025-10-06 13:43:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1301">
  <Title>Gene Name Extraction Using FlyBase Resources</Title>
  <Section position="1" start_page="0" end_page="5" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Machine-learning based entity extraction requires a large corpus of annotated training to achieve acceptable results. However, the cost of expert annotation of relevant data, coupled with issues of inter-annotator variability, makes it expensive and time-consuming to create the necessary corpora. We report here on a simple method for the automatic creation of large quantities of imperfect training data for a biological entity (gene or protein) extraction system. We used resources available in the FlyBase model organism database; these resources include a curated lists of genes and the articles from which the entries were drawn, together a synonym lexicon. We applied simple pattern matching to identify gene names in the associated abstracts and filtered these entities using the list of curated entries for the article. This process created a data set that could be used to train a simple Hidden Markov Model (HMM) entity tagger. The results from the HMM tagger were comparable to those reported by other groups (F-measure of 0.75). This method has the advantage of being rapidly transferable to new domains that have similar existing resources.</Paragraph>
    <Paragraph position="1">  Introduction: Biological Databases There is currently an information explosion in biomedical research. The growth of literature is roughly exponential, as can be seen in Figure 1 which shows the number of literature references in  This growth of literature makes it daunting for researchers to keep track of the information, even in very small subfields of biology.</Paragraph>
    <Paragraph position="2">  Increasingly, biological databases serve to collect and organize published experimental results. A wide range of biological databases exist, including model organism databases (e.g., for mouse  and yeast  ) as well as various protein databases (e.g.,  (PIR) or SWISStor), a model organism for genetics research: http://www.flybase.org.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML