File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-1301_concl.xml
Size: 4,100 bytes
Last Modified: 2025-10-06 13:53:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1301"> <Title>Gene Name Extraction Using FlyBase Resources</Title> <Section position="3" start_page="16" end_page="16" type="concl"> <SectionTitle> 6 Conclusion and Future Directions </SectionTitle> <Paragraph position="0"> We have demonstrated that we can automatically produce large quantities of relatively high quality training data; these data were good enough to train an HMM-based tagger to identify gene mentions with an F-measure of 75% (precision of 78% and recall of 71%), evaluated on our small development test set of 86 abstracts. This compares favorably with other reported results as described in Section 2, and as discussed below, we believe that we can improve upon these results in various ways.</Paragraph> <Paragraph position="1"> These results are still considerably below the results from [Gaizauskas03] and may be too low to be useful as a building block for further automated processing, such as relation extraction. However, in the absence of any shared benchmark evaluation sets, cross-system performance cannot be evaluated since the task definition and evaluation corpora differ from system to system.</Paragraph> <Paragraph position="2"> We plan to take this work in several directions.</Paragraph> <Paragraph position="3"> First, we believe that we can improve the quality of the underlying automatically generated data, and with this, the quality of the entity tagging. There are several things that could be improved.</Paragraph> <Paragraph position="4"> A morphological analyzer trained for biological text would eliminate some of the tokenization errors and perhaps capture some of the underlying regularities, such as addition of Greek letters or numbers (with or without preceding hyphen) to specify sub-types within a gene family. There can also be considerable semantic content in gene names and their formatting. For example, many Drosophila genes are differentiated from the genes of other organisms by prepending a &quot;d&quot; or &quot;D&quot;, such as &quot;dToll&quot;. Gene names can also be explicit descriptions of their chromosomal location or even function (e.g. Dopamine receptor).</Paragraph> <Paragraph position="5"> The problem of matching abbreviations has been tackled by a number of researchers [e.g. Pustejovsky02 and Liu03]. As was mentioned above, it seems that ambiguity for &quot;short forms&quot; of gene names could be partially resolved by detecting local definitions for abbreviations. It should also be possible to apply part of speech tagging and corpus statistics to avoid mis-tagging of common words, such as &quot;to&quot; or &quot;and&quot;.</Paragraph> <Paragraph position="6"> In the longer term, this methodology provides an opportunity to go beyond gene name tagging for Drosophila. It can be extended to other domains that have comparable resources (e.g. other model organism genome databases, other biological entities), and entity tagging itself provides the foundation for more complex tasks, such as relation extraction (e.g. using the BIND database) or attribute extraction (e.g. using FlyBase to identify attributes such as RNA transcript length, associated with protein coding genes).</Paragraph> <Paragraph position="7"> Second, the existence of a synonym lexicon with unique identifiers provides data for term normalization, a task of potentially greater utility to biologists than the tagging of every mention in an article. There are currently few corpora with annotated term normalization; using the methodology outlined here makes it possible to produce large quantities of normalized data. The identification and characterization of abbreviations and other transformations would be particularly important in normalization.</Paragraph> <Paragraph position="8"> By exploiting the rich set of biological resources that already exist, it should be possible to generate many kinds of corpora useful for training high-quality information extraction and text mining components.</Paragraph> </Section> class="xml-element"></Paper>