File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-3306_concl.xml
Size: 2,758 bytes
Last Modified: 2025-10-06 13:55:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3306"> <Title>Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries</Title> <Section position="7" start_page="46" end_page="47" type="concl"> <SectionTitle> 7 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> In this article we present a gene normalization system that is intended for use in human gene NER, but that can also be readily adapted to other biomedical normalization tasks. When optimized for human gene normalization, our system achieved 0.783 F-measure at the mention level.</Paragraph> <Paragraph position="1"> Choosing the proper normalization steps depends on several factors, such as (for genes) the organism of interest, the entity class, the accuracy of identifying gene mentions, and the reliability of the underlying dictionary. While the results of our normalizer compare favorably with previous efforts, much future work can be done to further improve the performance of our system, including: 1. Performance of identifying gene mentions.</Paragraph> <Paragraph position="2"> Only approximately 50 percent of gene mentions identified by our tagger were normalizable. While this is mostly due to the fact that the tagger identifies gene classes that cannot be normalized to a gene instance, a significant subset of gene instance mentions are not being normalized.</Paragraph> <Paragraph position="3"> 2. Reliability of the dictionary. Though we have investigated a sizable number of gene identifier sources, the four representative sources used for compiling our gene dictionary are incomplete and often not precise for individual terms. Some text mentions were not normalizable due the the incompleteness of our dictionary, which limited the recall.</Paragraph> <Paragraph position="4"> 3. Disambiguation. A small portion (typically 7%-10%) of the matches were ambiguous. Successful development of disambiguation tools can improve the performance. 4. Machine-learning. It is likely possible that op- null timized rules can be used as probabilistic features for a machine-learning-based version of our normalizer.</Paragraph> <Paragraph position="5"> Gene normalization has several potential applications, such as for biomedical information extraction, database curation, and as a prerequisite for relation extraction. Providing a proper synonym dictionary, our normalization program is amenable to generalizing to other organisms, and has already proven successful in our group for other entity normalization tasks. An interesting future study would be to determine accuracy for BioCreAtIvE data once mouse, Drosophila, and yeast vocabularies are incorporated into our system.</Paragraph> </Section> class="xml-element"></Paper>