File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-1316_concl.xml
Size: 2,610 bytes
Last Modified: 2025-10-06 13:53:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1316"> <Title>Selecting Text Features for Gene Name Classification: from Documents to Terms</Title> <Section position="6" start_page="3" end_page="3" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> Due to an enormous number of terms and the complex and inconsistent structure of the biomedical terminology, manual update of knowledge repositories are prone to be both inefficient and inconsistent (Nenadic et al., 2002b; Stapley et al., 2002). Therefore, automatic text-based classification of biological entities (such as gene and protein names) is essential for efficient knowledge management and systematic approach that can cope with huge volume of the biomedical literature. Furthermore, classified terms irrefutably have a positive impact on improving the results of IE/IR, knowledge acquisition, document classification and terminology management (Blaschke et al., 2002).</Paragraph> <Paragraph position="1"> In this paper we have examined the procedures for engineering text-based features at various levels of linguistic pre-processing, and considered their impacts on the performance of an SVM-based gene name classifier. The experiments have shown that simple linguistic pre-processing (such as lemmatisation and stemming) does not have significant influence on the performance, i.e. there is no need to pre-process documents. Also, reducing the feature space by selecting only features that appear in more documents does not result in decrease of the performance, but can significantly reduce the time needed for training. PMID-based classification has shown very good performance, but a PMID-based classifier can be applied only on the training set of documents.</Paragraph> <Paragraph position="2"> The experiments have also shown that using semantic indicators (represented by dynamically extracted domain-specific terms) can improve the performance compared to the standard bag-of-words approach, in particular at lower recall points, and for rare classes. This means that terms can be used as reliable features for classifying genes with higher confidence, and for under-represented classes. However, terminological analysis requires considerable pre-processing time.</Paragraph> <Paragraph position="3"> Our further research will focus on generating the biological interpretation and justification of the classification results by using terms (that have been used as key distinguishing features for classification) as semantic indicators of the corresponding classes.</Paragraph> </Section> class="xml-element"></Paper>