File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1307_metho.xml

Size: 14,918 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1307">
  <Title>Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions</Title>
  <Section position="4" start_page="47" end_page="48" type="metho">
    <SectionTitle>
4 Framework for Mining Protein-Protein
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="47" end_page="48" type="sub_section">
      <SectionTitle>
Interactions
</SectionTitle>
      <Paragraph position="0"> The extraction of interacting proteins from Medline abstracts proceeds in two separate steps:  1. First, we automatically identify protein names  using a CRF system trained on a set of 750 abstracts manually annotated for proteins (see Section 5 for details).</Paragraph>
      <Paragraph position="1"> 2. Based on the output of the CRF tagger, we filter out less confident extractions and then try to detect which pairs of the remaining extracted protein names are interaction pairs.</Paragraph>
      <Paragraph position="2"> For the second step, we investigate two general methods: AF Use co-citation analysis to score each pair of proteins based on the assumption that proteins co-occurring in a large number of abstracts tend to be interacting proteins. Out of the resulting protein pairs we keep only those that co-occur in abstracts likely to discuss interactions, based on a Naive Bayes classifier (see Section 6 for details).</Paragraph>
      <Paragraph position="3"> AF Given that we already have a set of 230 Medline abstracts manually tagged for both proteins and interactions, we can use it to train an interaction extractor. In Section 7 we discuss two different methods for learning this interaction extractor.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="48" end_page="49" type="metho">
    <SectionTitle>
5 A CRF Tagger for Protein Names
</SectionTitle>
    <Paragraph position="0"> The task of identifying protein names is made difficult by the fact that unlike other organisms, such as yeast or E. coli, the human genes have no standardized naming convention, and thus present one of the hardest sets of gene/protein names to extract. For example, human proteins may be named with typical English words, such as &amp;quot;light&amp;quot;, &amp;quot;map&amp;quot;, &amp;quot;complement&amp;quot;, and &amp;quot;Sonic Hedgehog&amp;quot;. It is therefore necessary that an information extraction algorithm be specifically trained to extract gene and protein names accurately.</Paragraph>
    <Paragraph position="1"> We have previously described (Bunescu et al., 2005) effective protein and gene name tagging using a Maximum Entropy based algorithm. Conditional Random Fields (CRF) (Lafferty et al., 2001) are new types of probabilistic models that preserve all the advantages of Maximum Entropy models and at the same time avoid the label bias problem by allowing a sequence of tagging decisions to compete against each other in a global probabilistic model.</Paragraph>
    <Paragraph position="2"> In both training and testing the CRF protein-name tagger, the corresponding Medline abstracts were processed as follows. Text was tokenized using white-space as delimiters and treating all punctuation marks as separate tokens. The text was segmented into sentences, and part-of-speech tags were assigned to each token using Brill's tagger (Brill, 1995). For each token in each sentence, a vector of binary features was generated using the feature templates employed by the Maximum Entropy approach described in (Bunescu et al., 2005). Generally, these features make use of the words occurring before and after the current position in the text, their POS tags and capitalization patterns. Each feature occurring in the training data is associated with a parameter in the CRF model. We used the CRF implementation from (McCallum, 2002). To train the CRF's parameters, we used 750 Medline abstracts manually annotated for protein names (Bunescu et al., 2005). We then used the trained system to tag protein and gene names in the entire set of 753,459 Medline abstracts citing the word &amp;quot;human&amp;quot;.</Paragraph>
    <Paragraph position="3"> In Figure 2 we compare the performance of the CRF tagger with that of the Maximum Entropy tagger from (Bunescu et al., 2005), using the same set of features, by doing 10-fold cross-validation on Yapex - a smaller dataset of 200 manually annotated abstracts (Franzen et al., 2002). Each model assigns to each extracted protein name a normalized confidence value. The precision-recall curves from Figure 2 are obtained by varying a threshold on the minimum accepted confidence. We also plot the precision and recall obtained by simply matching textual phrases against entries from a protein dictionary.</Paragraph>
    <Paragraph position="4">  The dictionary of human protein names was assembled from the LocusLink and Swissprot databases by manually curating the gene names and synonyms (87,723 synonyms between 18,879 unique gene names) to remove genes that were referred to as 'hypothetical' or 'probable' and also to omit entries that referred to more than one protein identifier.</Paragraph>
  </Section>
  <Section position="6" start_page="49" end_page="50" type="metho">
    <SectionTitle>
6 Co-citation Analysis and Bayesian
Classification
</SectionTitle>
    <Paragraph position="0"> In order to establish which interactions occurred between the proteins identified in the Medline abstracts, we used a 2-step strategy: measure co-citation of protein names, then enrich these pairs for physical interactions using a Bayesian filter. First, we counted the number of abstracts citing a pair of proteins, and then calculated the probability of co-citation under a random model based on the hypergeometric distribution (Lee et al., 2004; Jenssen et al., 2001) as:</Paragraph>
    <Paragraph position="2"> where C6 equals the total number of abstracts, D2 of which cite the first protein, D1 cite the second protein, and CZ cite both.</Paragraph>
    <Paragraph position="3"> Empirically, we find the co-citation probability has a hyperbolic relationship with the accuracy on the functional annotation benchmark from Section 3, with protein pairs co-cited with low random probability scoring high on the benchmark.</Paragraph>
    <Paragraph position="4"> With a threshold on the estimated extraction confidence of 80% (as computed by the CRF model) in the protein name identification, close to 15,000 interactions are extracted with the co-citation approach that score comparable or better on the functional benchmark than the manually extracted interactions from HPRD, which serves to establish a minimal threshold for our mined interactions.</Paragraph>
    <Paragraph position="5"> However, it is clear that proteins are co-cited for many reasons other than physical interactions. We therefore tried to enrich specifically for physical interactions by applying a secondary filter. We applied a Bayesian classifier (Marcotte et al., 2001) to measure the likelihood of the abstracts citing the protein pairs to discuss physical protein-protein interactions. The classifier scores each of the co-citing abstracts according to the usage frequency of discriminating words relevant to physical protein interactions. For a co-cited protein pair, we calculated the average score of co-citing Medline abstracts and used this to re-rank the top-scoring 15,000 co-cited protein pairs.</Paragraph>
    <Paragraph position="6"> Interactions extracted by co-citation and filtered using the Bayesian estimator compare favorably with the other interaction data sets on the functional annotation benchmark (Figure 3). Testing the accuracy of these extracted protein pairs on the physical interaction benchmark (Figure 4) reveals that the co-cited proteins scored high by this classifier are indeed strongly enriched for physical interactions.</Paragraph>
    <Paragraph position="7">  Keeping all the interactions that score better than HPRD, our co-citation / Bayesian classifier analysis yields 6,580 interactions between 3,737 proteins. By combining these interactions with the 26,280 interactions from the other sources, we obtained a fi- null nal set of 31,609 interactions between 7,748 human proteins.</Paragraph>
  </Section>
  <Section position="7" start_page="50" end_page="51" type="metho">
    <SectionTitle>
7 Learning Interaction Extractors
</SectionTitle>
    <Paragraph position="0"> In (Bunescu et al., 2005) we described a dataset of 230 Medline abstracts manually annotated for proteins and their interactions. This can be used as a training dataset for a method that learns interaction extractors. Such a method simply classifies a sentence containing two protein names as positive or negative, where positive means that the sentence asserts an interaction between the two proteins. However a sentence in the training data may contain more than two proteins and more than one pair of interacting proteins. In order to extract the interacting pairs, we replicate the sentences having D2 proteins</Paragraph>
    <Paragraph position="2"> sentences such that each one has exactly two of the proteins tagged, with the rest of the protein tags omitted. If the tagged proteins interact, then the replicated sentence is added to the set of positive sentences, otherwise it is added to the set of negative sentences. During testing, a sentence having D2 proteins (D2 AL BE) is again replicated into</Paragraph>
    <Paragraph position="4"> sentences in a similar way.</Paragraph>
    <Section position="1" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
7.1 Extraction using Longest Common
Subsequences (ELCS)
</SectionTitle>
      <Paragraph position="0"> Blaschke et al. (Blaschke and Valencia, 2001; Blaschke and Valencia, 2002) manually developed rules for extracting interacting proteins. Each of their rules (or frames) is a sequence of words (or POS tags) and two protein-name tokens. Between every two adjacent words is a number indicating the maximum number of intervening words allowed when matching the rule to a sentence. In (Bunescu et al., 2005) we described a new method ELCS (Extraction using Longest Common Subsequences) that automatically learns such rules. ELCS' rule representation is similar to that in (Blaschke and Valencia, 2001; Blaschke and Valencia, 2002), except that it currently does not use POS tags, but allows disjunctions of words. Figure 5 shows an example of a rule learned by ELCS. Words in square brackets separated by 'CY' indicate disjunctive lexical constraints, i.e. one of the given words must match the sentence at that position. The numbers in parentheses between adjacent constraints indicate the maximum number of unconstrained words allowed between the two (called a word gap). The protein names are denoted here with PROT. A sentence matches the rule if and only if it satisfies the word constraints in the given order and respects the respective word gaps.</Paragraph>
    </Section>
    <Section position="2" start_page="50" end_page="51" type="sub_section">
      <SectionTitle>
7.2 Extraction using a Relation Kernel (ERK)
</SectionTitle>
      <Paragraph position="0"> Both Blaschke and ELCS do interaction extraction based on a limited set of matching rules, where a rule is simply a sparse (gappy) subsequence of words (or POS tags) anchored on the two protein-name tokens.</Paragraph>
      <Paragraph position="1"> Therefore, the two methods share a common limitation: either through manual selection (Blaschke), or as a result of the greedy learning procedure (ELCS), they end up using only a subset of all possible anchored sparse subsequences. Ideally, we would want to use all such anchored sparse subsequences as features, with weights reflecting their relative accuracy. However explicitly creating for each sentence a vector with a position for each such feature is infeasible, due to the high dimensionality of the feature space. Here we can exploit an idea used before in string kernels (Lodhi et al., 2002): computing the dot-product between two such vectors amounts to calculating the number of common anchored sub-sequences between the two sentences. This can be done very efficiently by modifying the dynamic programming algorithm from (Lodhi et al., 2002) to account only for anchored subsequences i.e. sparse subsequences which contain the two protein-name tokens. Besides restricting the word subsequences to be anchored on the two protein tokens, we can further prune down the feature space by utilizing the following property of natural language statements: whenever a sentence asserts a relationship between two entity mentions, it generally does this using one of the following three patterns: AF [FI] Fore-Inter: words before and between the two entity mentions are simultaneously used to express the relationship. Examples: 'interac- null AF [I] Inter: only words between the two entity mentions are essential for asserting the relationship. Examples: 'CWC8  Another useful observation is that all these patterns use at most 4 words to express the relationship (not counting the two entities). Consequently, when computing the relation kernel, we restrict the counting of common anchored subsequences only to those having one of the three types described above, with a maximum word-length of 4. This type of feature selection leads not only to a faster kernel computation, but also to less overfitting, which results in increased accuracy (we omit showing here comparative results supporting this claim, due to lack of space).</Paragraph>
      <Paragraph position="2"> We used this kernel in conjunction with Support Vector Machines (Vapnik, 1998) learning in order to find a decision hyperplane that best separates the positive examples from negative examples. We modified the libsvm package for SVM learning by plugging in the kernel described above.</Paragraph>
    </Section>
    <Section position="3" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
7.3 Preliminary experimental results
</SectionTitle>
      <Paragraph position="0"> We compare the following three systems on the task of retrieving protein interactions from the dataset of 230 Medline abstracts (assuming gold standard proteins): null AF [Manual]: We report the performance of the rule-based system of (Blaschke and Valencia, 2001; Blaschke and Valencia, 2002).</Paragraph>
      <Paragraph position="1"> AF [ELCS]: We report the 10-fold cross-validated results from (Bunescu et al., 2005) as a precision-recall graph.</Paragraph>
      <Paragraph position="2"> AF [ERK]: Based on the same splits as those used by ELCS, we compute the corresponding precision-recall graph.</Paragraph>
      <Paragraph position="3"> The results, summarized in Figure 6, show that the relation kernel outperforms both ELCS and the manually written rules. In future work, we intend to analyze the complete Medline with ERK and integrate the extracted interactions into a larger composite set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML