File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1316_intro.xml
Size: 4,145 bytes
Last Modified: 2025-10-06 14:02:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1316"> <Title>Selecting Text Features for Gene Name Classification: from Documents to Terms</Title> <Section position="3" start_page="3" end_page="3" type="intro"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> An SVM is a binary classification method that combines statistical learning and optimisation techniques with kernel mapping (Vapnik, 1995). The main idea of the method is to automatically learn a separation hyperplane from a set of training examples, which splits classified entities into two subsets according to a certain classification property. The optimisation part is used to maximise the distance (called the margin) of each of the two subsets from the hyperplane.</Paragraph> <Paragraph position="1"> The SVM approach has been used for different classification tasks quite successfully, in particular for document classification, where the method out-performed many alternative approaches (Joachims, 1998). Similarly, SVMs have been used for term classification. For example, a bag-of-simple-words approach with idf-like weights was used to learn a multi-class SVM classifier for protein cellular location classification (Stapley et al., 2002). Proteins were represented by feature vectors consisting of simple words co-occurring with them in a set of relevant Medline abstracts. The precision of the method was better than that of a classification method based on experimental data, and similar to a rule-based classifier.</Paragraph> <Paragraph position="2"> Unlike many other classification methods that have difficulties coping with huge dimensions, one of the main advantages of the SVM approach is that its performance does not depend on the dimensionality of the space where the hyperplane separation takes place. This fact has been exploited in the way that many authors have suggested that &quot;there are few irrelevant features&quot; and that &quot;SVMs eliminate the need for feature selection&quot; (Joachims, 1998). It has been shown that even the removal of stop-words is not necessary (Leopold and Kindermann, 2002).</Paragraph> <Paragraph position="3"> Few approaches have been undertaken only recently to tune the original SVM approach by selecting different features, or by using different feature weights and kernels, mostly for the document classification task. For example, Leopold and Kindermann (2002) have discussed the impact of different feature weights on the performance of SVMs in the case of document classification in English and German. They have reported that an entropy-like weight was generally performing better than idf, in particular for larger documents.</Paragraph> <Paragraph position="4"> Also, they suggested that, if using single words as features, the lemmatisation was not necessary, as it had no significant impact on the performance.</Paragraph> <Paragraph position="5"> Lodhi et al. (2002) have experimented with different kernels for document classification. They have shown that a string kernel (which generates all sub-sequences of a certain number of characters) could be an effective alternative to linear kernel SVMs, in particular in the sense of efficiency. In the case of term classification, Kazama et al.</Paragraph> <Paragraph position="6"> (2002) used a more exhaustive feature set containing lexical information, POS tags, affixes and their combinations in order to recognise and classify terms into a set of general biological classes used within the GENIA project (GENIA, 2003). They investigated the influence of these features on the performance. For example, they claimed that suffix information was helpful, while POS and prefix features did not have clear or stable influence.</Paragraph> <Paragraph position="7"> While each of these studies used some kind of orthographical and/or lexical indicators to generate relevant features, we wanted to investigate the usage of semantic indicators (such as domain-specific terms) as classification features, and to compare their performance with the classic lexically-based features.</Paragraph> </Section> class="xml-element"></Paper>