File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-3303_relat.xml

Size: 5,563 bytes

Last Modified: 2025-10-06 14:15:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3303">
  <Title>Using the Gene Ontology for Subcellular Localization Prediction</Title>
  <Section position="3" start_page="17" end_page="18" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Several different learning algorithms have been explored for text classification (Dumais et al, 1998) and support vector machines (SVMs) (Vapnik, 1995) were found to be the most computationally efficient and to have the highest precision/recall break-even point (BEP, the point where precision equals recall). Joachims performed a very thorough evaluation of the suitability of SVMs for text classification (Joachims, 1998). Joachims states that SVMs are perfect for textual data as it produces sparse training instances in very high dimensional space.</Paragraph>
    <Paragraph position="1"> Soon after Joachims' survey, researchers started using SVMs to classify biological journal abstracts.</Paragraph>
    <Paragraph position="2"> Stapley et al. (2002) used SVMs to predict the subcellular localization of yeast proteins. They created a data set by mining Medline for abstracts containing a yeast gene name, which achieved F-measures in the range [0.31,0.80]. F-measure is defined as</Paragraph>
    <Paragraph position="4"> where p is precision and r is recall. They expanded their training data to include extra biological information about each protein, in the form of amino acid content, and raised their F-measure by as much as 0.05. These results are modest, but before Stapley et al. most localization classification systems were built using text rules or were sequence based. This was one of the first applications of SVMs to biological journal abstracts and it showed that text and amino acid composition together yield better results than either alone.</Paragraph>
    <Paragraph position="5"> Properties of proteins themselves were again used to improve text categorization for animal, plant and fungi subcellular localization data sets (H&amp;quot;oglund et al, 2006). The authors' text classifiers were based on the most distinguishing terms of documents, and they included the output of four protein sequence classifiers in their training data. They measure the performance of their classifier using what they call sensitivity and specificity, though the formulas cited are the standard definitions of recall and precision. Their text-only classifier for the animal MultiLoc data set had recall (sensitivity) in the range [0.51,0.93] and specificity (precision) [0.32,0.91]. The MultiLocText classifiers, which include sequence-based classifications, have recall [0.82,0.93] and precision [0.55,0.95]. Their overall and average accuracy increased by 16.2% and 9.0% to 86.4% and 94.5% respectively on the PLOC animal data set when text was augmented with additional sequence-based information.</Paragraph>
    <Paragraph position="6"> Our method is motivated by the improvements that Stapley et al. and H&amp;quot;oglund et al. saw when they included additional biological information. However, our technique uses knowledge of a textual nature to improve text classification; it uses no information from the amino acid sequence. Thus, our approach can be used in conjunction with techniques that use properties of the protein sequence.</Paragraph>
    <Paragraph position="7"> In non-biological domains, external knowledge has already been used to improve text categorization (Gabrilovich and Markovitch, 2005). In their  research, text categorization is applied to news documents, newsgroup archives and movie reviews. The authors use the Open Directory Project (ODP) as a source of world knowledge to help alleviate problems of polysemy and synonymy. The ODP is a hierarchy of concepts where each concept node has links to related web pages. The authors mined these web pages to collect characteristic words for each concept. Then a new document was mapped, based on document similarity, to the closest matching ODP concept and features were generated from that concept's meaningful words. The generated features, along with the original document, were fed into an SVM text classifier. This technique yielded BEP as high as 0.695 and improvements of up to 0.254.</Paragraph>
    <Paragraph position="8"> We use Gabrilovich and Markovitch's (2005) idea to employ an external knowledge hierarchy, in our case the GO, as a source of information. It has been shown that GO molecular function annotations in Swiss-Prot are indicative of subcellular localization annotations (Lu and Hunter, 2005), and that GO node names made up about 6% of a sample Medline corpus (Verspoor et al, 2003). Some consider GO terms to be too rare to be of use (Rice et al, 2005), however we will show that although the presence of GO terms is slight, the terms are powerful enough to improve text classification. Our technique's success may be due to the fact that we include the synonyms of GO node names, which increases the number of GO terms found in the documents.</Paragraph>
    <Paragraph position="9"> We use the GO hierarchy in a different way than Gabrilovich et al. use the ODP. Unlike their approach, we do not extract additional features from all articles associated with a node of the GO hierarchy.</Paragraph>
    <Paragraph position="10"> Instead we use synonyms of nodes and the names of ancestor nodes. This is a simpler approach, as it doesn't require retrieving all abstracts for all proteins of a GO node. Nonetheless, we will show that our approach is still effective.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML