File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3307_metho.xml

Size: 14,309 bytes

Last Modified: 2025-10-06 14:10:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3307">
  <Title>Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline</Title>
  <Section position="4" start_page="50" end_page="50" type="metho">
    <SectionTitle>
3 Co-occurrence statistics
</SectionTitle>
    <Paragraph position="0"> Given two entities with multiple mentions in a large corpus, another approach to detect whether a relationship holds between them is to use statistics over their occurrences in textual patterns that are indicative for that relation. Various measures such as pointwise mutual information (PMI) , chi-square (AVBE) or log-likelihood ratio (LLR) (Manning and Sch&amp;quot;utze, 1999) use the two entities' occurrence statistics to detect whether their co-occurrence is due to chance, or to an underlying relationship.</Paragraph>
    <Paragraph position="1"> A recent example is the co-citation approach from (Ramani et al., 2005), which does not try to find specific assertions of interactions in text, but rather exploits the idea that if many different abstracts reference both protein D4</Paragraph>
    <Paragraph position="3"> are likely to interact. Particularly, if the two proteins are co-cited significantly more often than one would expect if they were cited independently at random, then it is likely that they interact. The model used to compute the probability of random co-citation is based on the hypergeometric distribution (Lee et al., 2004; Jenssen et al., 2001). Thus, if C6 is the total number of abstracts, D2 of which cite the first protein, D1 cite the second protein, and CZ cite both, then the probability of co-citation under a random model is:</Paragraph>
    <Paragraph position="5"> The approach that we take in this paper is to constrain the two proteins to be mentioned in the same sentence, based on the assumption that if there is a reason for two protein names to co-occur in the same sentence, then in most cases that is caused by their interaction. To compute the &amp;quot;degree of interaction&amp;quot; between two proteins D4</Paragraph>
    <Paragraph position="7"> , we use the information-theoretic measure of pointwise mutual information (Church and Hanks, 1990; Manning  and Sch&amp;quot;utze, 1999), which is computed based on the following quantities: 1. C6 : the total number of protein pairs co-occurring in the same sentence in the corpus. 2. C8B4D4</Paragraph>
    <Paragraph position="9"> co-occur in the same sentence; D2</Paragraph>
  </Section>
  <Section position="5" start_page="50" end_page="50" type="metho">
    <SectionTitle>
BDBE
= the
</SectionTitle>
    <Paragraph position="0"> number of sentences mentioning both D4</Paragraph>
    <Paragraph position="2"> and D4.</Paragraph>
    <Paragraph position="3"> The PMI is then defined as in Equation 2 below:  Given that the PMI will be used only for ranking pairs of potentially interacting proteins, the constant factor C6 and the D0D3CV operator can be ignored. For sake of simplicity, we use the simpler formula from  and assume that a sentence-level relation extractor is available, with the capability of computing normalized confidence values for all extractions. Then one way of using the extraction confidence is to have each co-occurrence weighted by its confidence, i.e. replace the constant BD with the normalized scores</Paragraph>
    <Paragraph position="5"> B5, as illustrated in Equation 5. This results in a new formula DBC8C5C1 (weighted PMI), which is equal with the product between D7C8C5C1 and the average aggregation operator A0  can be replaced with any other aggregation operator from Table 1. As argued in Section 2, we consider D1CPDC to be the most appropriate operator for our task, therefore the integrated model is based on the weighted PMI product illustrated in</Paragraph>
  </Section>
  <Section position="6" start_page="50" end_page="52" type="metho">
    <SectionTitle>
5 Evaluation Corpus
</SectionTitle>
    <Paragraph position="0"> Contrasting the performance of the integrated model against the sentence-level extractor or the PMI-based ranking requires an evaluation dataset that provides two types of annotations:  1. The complete list of interactions reported in the corpus (Section 5.1).</Paragraph>
    <Paragraph position="1"> 2. Annotation of mentions of genes and proteins,  together with their corresponding gene identifiers (Section 5.2).</Paragraph>
    <Paragraph position="2"> We do not differentiate between genes and their protein products, mapping them to the same gene identifiers. Also, even though proteins may participate in different types of interactions, we are concerned only with detecting whether they interact in the general sense of the word.</Paragraph>
    <Section position="1" start_page="50" end_page="52" type="sub_section">
      <SectionTitle>
5.1 Medline Abstracts and Interactions
</SectionTitle>
      <Paragraph position="0"> In order to compile an evaluation corpus and an associated comprehensive list of interactions, we exploited information contained in the HPRD (Peri et al., 2004) database. Every interaction listed in HPRD is linked to a set of Medline articles where the corresponding experiment is reported. More exactly, each interaction is specified in the database as a tuple that contains the LocusLink (now EntrezGene) identifiers of all genes involved and the PubMed identifiers of the corresponding articles (as illustrated in  We found that this protein binds to three other Z-disc proteins; therefore, we have named it FATZ, gamma-filamin, alpha-actinin and telethonin binding protein of the Z-disc.</Paragraph>
      <Paragraph position="2"> The evaluation corpus (henceforth referred to as the HPRD corpus) is created by collecting the Medline abstracts corresponding to interactions between human proteins, as specified in HPRD. In total, 5,617 abstracts are included in this corpus, with an associated list of 7,785 interactions. This list is comprehensive - the HPRD database is based on an annotation process in which the human annotators report all interactions described in a Medline article.</Paragraph>
      <Paragraph position="3"> On the other hand, the fact that only abstracts are included in the corpus (as opposed to including the full article) means that the list may contain interactions that are not actually reported in the HPRD corpus. Nevertheless, if the abstracts were annotated  with gene mentions and corresponding GIDs, then a &amp;quot;quasi-exact&amp;quot; interaction list could be computed based on the following heuristic:  participants in an interaction that is linked to D4D1CXCS in HPRD, then consider that the abstract (and consequently the entire HPRD corpus) reports the interaction between CVCXCS</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="52" end_page="52" type="metho">
    <SectionTitle>
BD
and CVCXCS
BE
</SectionTitle>
    <Paragraph position="0"> . A4 An application of the above heuristic is shown at the bottom of Table 2. The HPRD record at the top of the table specifies that the Medline article with ID 10984498 reports an interaction between the proteins FATZ (with ID 58529) and gamma-filamin (with ID 2318). The two protein names are mentioned in a sentence in the abstract for 10984498, therefore, by [H], we consider that the HPRD corpus reports this interaction.</Paragraph>
    <Paragraph position="1"> This is very similar to the procedure used in (Craven, 1999) for creating a &amp;quot;weakly-labeled&amp;quot; dataset of subcellular-localization relations. [H] is a strong heuristic - it is already known that the full article reports an interaction between the two genes. Finding the two genes collocated in the same sentence in the abstract is very likely to be due to the fact that the abstract discusses their interaction. The heuristic can be made even more accurate if a pair of genes is considered as interacting only if they co-occur in a (predefined) minimum number of sentences in the entire corpus - with the evaluation modified accordingly, as described later in Section 6.</Paragraph>
    <Section position="1" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
5.2 Gene Name Annotation and Normalization
</SectionTitle>
      <Paragraph position="0"> For the annotation of gene names and their normalization, we use a dictionary-based approach similar to (Cohen, 2005). NCBI1 provides a comprehensive dictionary of human genes, where each gene is specified by its unique identifier, and qualified with an official name, a description, synonym names and one or more protein names, as illustrated in Table 2.</Paragraph>
      <Paragraph position="1"> All of these names, including the description, are considered as potential referential expressions for the gene entity. Each name string is reduced to a normal form by: replacing dashes with spaces, introducing spaces between sequences of letters and se1URL: http://www.ncbi.nih.gov quences of digits, replacing Greek letters with their Latin counterparts (capitalized), substituting Roman numerals with Arabic numerals, decapitalizing the first word if capitalized. All names are further tokenized, and checked against a dictionary of close to 100K English nouns. Names that are found in this dictionary are simply filtered out. We also ignore all ambiguous names (i.e. names corresponding to more than one gene identifier). The remaining non-ambiguous names are added to the final gene dictionary, which is implemented as a trie-like structure in order to allow a fast lookup of gene IDs based on the associated normalized sequences of tokens.</Paragraph>
      <Paragraph position="2"> Each abstract from the HPRD corpus is tokenized and segmented in sentences using the OpenNLP2 package. The resulting sentences are then annotated by traversing them from left to right and finding the longest token sequences whose normal forms match entries from the gene dictionary.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="52" end_page="52" type="metho">
    <SectionTitle>
6 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> The main purpose of the experiments in this section is to compare the performance of the following four methods on the task of corpus-level relation extraction: null  1. Sentence-level relation extraction followed by the application of an aggregation operator that assembles corpus-level results (SSK.Max).</Paragraph>
    <Paragraph position="1"> 2. Pointwise Mutual Information (PMI).</Paragraph>
    <Paragraph position="2"> 3. The integrated model, a product of the two base models (PMI.SSK.Max).</Paragraph>
    <Paragraph position="3"> 4. The hypergeometric co-citation method (HG).</Paragraph>
  </Section>
  <Section position="9" start_page="52" end_page="53" type="metho">
    <SectionTitle>
7 Experimental Methodology
</SectionTitle>
    <Paragraph position="0"> All abstracts, either from the HPRD corpus, or from the entire Medline, are annotated using the dictionary-based approach described in Section 5.2.</Paragraph>
    <Paragraph position="1"> The sentence-level extraction is done with the sub-sequence kernel (SSK) approach from (Bunescu and Mooney, 2005), which was shown to give good results on extracting interactions from biomedical abstracts. The subsequence kernel was trained on a set of 225 Medline abstracts which were manually  annotated with protein names and their interactions. It is known that PMI gives undue importance to low frequency events (Dunning, 1993), therefore the evaluation considers only pairs of genes that occur at least 5 times in the whole corpus.</Paragraph>
    <Paragraph position="2"> When evaluating corpus-level extraction on HPRD, because the &amp;quot;quasi-exact&amp;quot; list of interactions is known, we report the precision-recall (PR) graphs, where the precision (P) and recall (R) are computed as follows:</Paragraph>
  </Section>
  <Section position="10" start_page="53" end_page="53" type="metho">
    <SectionTitle>
C8 BP
</SectionTitle>
    <Paragraph position="0"> AZtrue interactions extracted AZtotal interaction extracted</Paragraph>
  </Section>
  <Section position="11" start_page="53" end_page="53" type="metho">
    <SectionTitle>
CA BP
</SectionTitle>
    <Paragraph position="0"> AZtrue interactions extracted AZtrue interactions All pairs of proteins are ranked based on each scoring method, and precision recall points are computed by considering the top C6 pairs, where C6 varies from 1 to the total number of pairs.</Paragraph>
    <Paragraph position="1"> When evaluating on the entire Medline, we used the shared protein function benchmark described in (Ramani et al., 2005). Given the set of interacting pairs recovered at each recall level, this benchmark calculates the extent to which interaction partners in a data set share functional annotation, a measure previously shown to correlate with the accuracy of functional genomics data sets (Lee et al., 2004). The KEGG (Kanehisa et al., 2004) and Gene Ontology (Ashburner et al., 2000) databases provide specific pathway and biological process annotations for approximately 7,500 human genes, assigning human genes into 155 KEGG pathways (at the lowest level of KEGG) and 1,356 GO pathways (at level 8 of the GO biological process annotation).</Paragraph>
    <Paragraph position="2"> The scoring scheme for measuring interaction set accuracy is in the form of a log odds ratio of gene pairs sharing functional annotations. To evaluate a data set, a log likelihood ratio (LLR) is calculated as follows:  where C8B4BWCYC1B5 and C8B4BWCYBMC1B5 are the probability of observing the data BW conditioned on the genes sharing benchmark associations (C1) and not sharing benchmark associations (BMC1). In its expanded form (obtained by Bayes theorem), C8B4C1CYBWB5 and C8B4BMC1CYBWB5 are estimated using the frequencies of interactions observed in the given data set BW between annotated genes sharing benchmark associations and not sharing associations, respectively, while the priors C8B4C1B5 and C8B4BMC1B5 are estimated based on the total frequencies of all benchmark genes sharing the same associations and not sharing associations, respectively. A score of zero indicates interaction partners in the data set being tested are no more likely than random to belong to the same pathway or to interact; higher scores indicate a more accurate data set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML