File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3304_metho.xml
Size: 7,264 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3304"> <Title>Integrating Ontological Knowledge and Textual Evidence in Estimating Gene and Gene Product Similarity</Title> <Section position="4" start_page="27" end_page="28" type="metho"> <SectionTitle> 3 Textual Evidence Selection </SectionTitle> <Paragraph position="0"> Our first step in integrating textual evidence into the XOA algorithm is to select salient information from biomedical literature germane to the problem.</Paragraph> <Paragraph position="1"> Several approaches can be used to carry out this prerequisite. For example, one possibility is to collect documents relevant to the task at hand, e.g.</Paragraph> <Paragraph position="2"> through PubMed queries, and use feature weighting and selection techniques from the Information Retrieval literature[?]e.g. tf*idf (Buckley 1985) and Information Gain (e.g. Yang and Pedersen 1997)[?]to distill the most relevant information. Another possibility is to use Information Extraction algorithms tailored to the biomedical domain such as Medstract (http://www.medstract.org, Pustejovsky et al. 2002) to extract entity-relationship structures of relevance. Yet another possibility is to use specialized tools such as GoPubMed (Doms and Schroeder 2005) where traditional keyword-based capabilities are coupled with term extraction and ontological annotation techniques.</Paragraph> <Paragraph position="3"> In our study, we opted for the latter solution, using generic Information Retrieval techniques to normalize and weigh the textual evidence extracted. The main advantage of this choice is that tools such as GoPubMed provide very high quality term extraction at no cost. Less appealing is the fact that the textual evidence provided is GO-based and therefore does not offer information which is orthogonal to the gene ontology. It is reasonable to expect better results than those reported in this paper if more GO-independent textual evidence were brought to bear. We are currently working on using Medstract as a source of additional textual evidence. null GoPubMed is a web server which allows users to explore PubMed search results using the Gene Ontology for categorization and navigation purposes (available at http://www.gopubmed.org). As shown in Figure 1 below, the system offers the following functionality: results which map to GO categories (e.g. highlighted terms other than &quot;Rab5&quot; in the middle windowpane of Figure 1).</Paragraph> <Paragraph position="4"> In integrating textual evidence with the XOA algorithm, we utilized the last functionality (automatic extraction of terms) as an Information Extraction capability. Details about the term extraction algorithm used in GoPubMed are given in Delfs et al. (2004). In short, the GoPubMed term extraction algorithm uses word alignment strategies in combination with stemming to match word sequences from PubMed abstracts with GO terms.</Paragraph> <Paragraph position="5"> In doing so, partial and discontinuous matches are allowed. Partial and discontinuous matches are weighted according to closeness of fit. This is indicated by the accuracy percentages associated with GO in Figure 1 (right side). In this study we did not make use of these accuracy percentages, but plan to do so in the future.</Paragraph> <Paragraph position="6"> system after the user issues the protein query and then selects the GO term &quot;late endosome&quot; (bottom left) as the discriminating parameter.</Paragraph> <Paragraph position="7"> Our data set consists of 2360 human protein pairs containing 1783 distinct human proteins. This data set was obtained as a 1% random sample of the human proteins used in the benchmark study of Posse et al. (2006)-see Table 1.</Paragraph> <Paragraph position="8"> For each of the 1783 human proteins, we made a GoPubMed query and retrieved up to 100 abstracts. We then collected all the terms extracted by GoPubMed for each protein across the abstracts retrieved. Table 2 provides an example of the output of this process. nutrient, uptake, carbohydrate, metabolism, affecting, cathepsin, activity, protein, lipid, growth, rate, habitually, signal, transduction, fat, protein, cadherin, chromosomal, responses, exogenous, lactating, exchanges, affects, mammary, gland, ....</Paragraph> <Paragraph position="9"> We chose such a small sample to facilitate the collection of evidence from GoPubMed, which is not yet fully automated. Our XOA approach is very scalable, and we do not anticipate any problem running the full protein data set of 255,502 pairs, once we fully automate the GoPubMed extraction process.</Paragraph> </Section> <Section position="5" start_page="28" end_page="29" type="metho"> <SectionTitle> 4 Integrating Textual Evidence in XOA </SectionTitle> <Paragraph position="0"> Using the output of the GoPubMed term extraction process, we created vector-based signatures for each of the 1783 proteins, where * features are obtained by stemming the terms provided by GoPubMed * the value for each feature is derived as the tf*idf for the feature.</Paragraph> <Paragraph position="1"> We then calculated the similarity between each of the 2360 protein pairs as the cosine value of the two vector-based signatures associated with the protein pair.</Paragraph> <Paragraph position="2"> We tried two different strategies to augment the XOA score for protein similarity using the protein similarity values obtained as the cosine of the GoPubMed term-based signatures. The first strategy adopts a fusion approach in which the two similarity measures are first normalized to be commensurable and then combined to provide an interpretable integrated model. A simple normalization is obtained by observing that the Resnik's information content measure is commensurable to the log of the text based cosine (LC). This leads us to the fusion model shown in (5) for XOA, based on Resnik's semantic similarity measure (XOA are highly correlated (correlations exceed 0.95 on the large benchmarking dataset discussed in section 2, see Table 1). This suggests the fusion model shown in (6), where the averages of the XOA scores are computed from the benchmarking data set.</Paragraph> <Paragraph position="3"> The second strategy consists in building a prediction model for BLAST bit score (BBS) using the XOA score and the log-cosine LC as predictors without the constraint of remaining interpretable. As in the previous strategy, a different model was sought for each of the three XOA variants. In each case, we restrict ourselves to cubic polynomial regression models as such models are quite efficient at capturing complex nonlinear relationships between target and predictors (e.g. Weisberg 2005). More precisely, for each of the semantic similarity measures, we fit the regression model to BBS shown in (7), where the subscript x denotes either R, L or JC, and the coefficients a to h are found by maximizing the Spearman rank order correlations between BBS and the regression model. This maximization is automatically carried out by using a random walk optimization approach (Romeijn 1992). The coefficients used in this study for each semantic similarity measure are shown in Table 3.</Paragraph> </Section> class="xml-element"></Paper>