File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1092_metho.xml

Size: 13,081 bytes

Last Modified: 2025-10-06 14:09:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1092">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 732-739, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Multi-way Relation Classification: Application to Protein-Protein Interactions</Title>
  <Section position="4" start_page="732" end_page="734" type="metho">
    <SectionTitle>
3 Data
</SectionTitle>
    <Paragraph position="0"> We use the information from a domain-specific database to gather labeled data for the task of classifying the interactions between proteins in text. The manually-curated HIV-1 Human Protein Interaction Database provides a summary of documented interactions between HIV-1 proteins and host cell proteins, other HIV-1 proteins, or proteins from disease organisms associated with HIV or AIDS. We use this database also because it contains information about the type of interactions, as opposed to other protein interaction databases (BIND, MINT, DIP, for example3) that list the protein pairs interacting, without 3DIP lists only the protein pairs, BIND has only some information about the method used to provide evidence for the interaction, and MIND does have interaction type information but the vast majority of the entries (99.9% of the 47,000 pairs)  interactions of the HIV-1 database, after removing the distinction in directionality and the triples with more than one interaction.</Paragraph>
    <Paragraph position="1"> specifying the type of interactions.</Paragraph>
    <Paragraph position="2"> In this database, the definitions of the interactions depend on the proteins involved and the articles describing the interactions; thus there are several definitions for each interaction type. For the interaction bind and the proteins ANT and Vpr, we find (among others) the definition &amp;quot;Interaction of HIV-1 Vpr with human adenine nucleotide translocator (ANT) is presumed based on a specific binding interaction between Vpr and rat ANT.&amp;quot; The database contains 65 types of interactions and 809 proteins for which there is interaction information, for a total of 2224 pairs of interacting proteins. For each documented protein-protein interaction the database includes information about:  a0 A pair of proteins (PP), a0 The interaction type(s) between them (I), and a0 PubMed identification numbers of the journal  article(s) describing the interaction(s) (A).</Paragraph>
    <Paragraph position="3"> A protein pair a1a2a1 can have multiple interactions (for example, AIP1 binds to HIV-1 p6 and also is incorporated into it) for an average of 1.9 interactions per a1a2a1 and a maximum of 23 interactions for the pair CDK9 and tat p14.</Paragraph>
    <Paragraph position="4"> We refer to the combination of a protein pair a1a2a1 and an article a3 as a &amp;quot;triple.&amp;quot; Our goal is to automatically associate to each triple an interaction have been assigned the same type of interaction (aggregation). These databases are all manually curated.</Paragraph>
    <Paragraph position="5">  type. For the example above, the triple &amp;quot;AIP1 HIV-1-p6 14519844&amp;quot; is assigned the interaction binds (14519844 being the PubMed number of the paper providing evidence for this interaction)4.</Paragraph>
    <Paragraph position="6"> Journal articles can contain evidence for multiple interactions: there are 984 journal articles in the database and on average each article is reported to contain evidence for 5.9 triples (with a maximum number of 90 triples).</Paragraph>
    <Paragraph position="7"> In some cases the database reports multiple different interactions for a given triple. There are 5369 unique triples in the database and of these 414 (7.7%) have multiple interactions. We exclude these triples from our analysis; however, we do include articles and a1a2a1 s with multiple interactions. In other words, we tackle cases such as the example above of the pair AIP1, HIV-1-p6 (that can both bind and incorporate) as long as the evidence for the different interactions is given by two different articles.</Paragraph>
    <Paragraph position="8"> Some of the interactions differ only in the directionality (e.g., regulates and regulated by, inhibits and inhibited by, etc.); we collapsed these pairs of related interactions into one5. Table 1 shows the list of the 25 interactions of the HIV-1 database for which there are more than 10 triples.</Paragraph>
    <Paragraph position="9"> For these interactions and for a random subset of the protein pairs a1a2a1 (around 45% of the total pairs in the database), we downloaded the corresponding full-text papers. From these, we extracted all and only those sentences that contain both proteins from the indicated protein pair. We assigned each of these sentences the corresponding interaction a4 from the database (&amp;quot;papers&amp;quot;).</Paragraph>
    <Paragraph position="10"> Nakov et al. (2004) argue that the sentences surrounding citations to related work, or citances, are a useful resource for bioNLP. Building on that work, we use citances as an additional form of evidence to determine protein-protein interaction types. For a given database entry containing PubMed article a3 , 4To be precise, there are for this a5a6a5 (as there are often) multiple articles (three in this case) describing the interaction binds, thus we have the following three triples to which we associate binds: &amp;quot;AIP1 HIV-1-p6 14519844,&amp;quot; &amp;quot;AIP1 HIV-1-p6 14505570&amp;quot; and &amp;quot;AIP1 HIV-1-p6 14505569.&amp;quot; 5We collapsed these pairs because the directionality of the interactions was not always reliable in the database. This implies that for some interactions, we are not able to infer the different roles of the two proteins; we considered only the pair &amp;quot;prot1 prot2&amp;quot; or &amp;quot;prot2 prot1,&amp;quot; not both. However, our algorithm can detect which proteins are involved in the interactions. protein pair a1a2a1 , and interaction type a4 , we downloaded a subset of the papers that cite a3 . From these citing papers, we extracted all and only those sentences that mention a3 explicitly; we further filtered these to include all and only the sentences that contain a1a7a1 . We labeled each of these sentences with interaction type a4 (&amp;quot;citances&amp;quot;).</Paragraph>
    <Paragraph position="11"> There are often many different names for the same protein. We use LocusLink6 protein identification numbers and synonym names for each protein, and extract the sentences that contain an exact match for (some synonym of) each protein. By being conservative with protein name matching, and by not doing co-reference analysis, we miss many candidate sentences; however this method is very precise.</Paragraph>
    <Paragraph position="12"> On average, for &amp;quot;papers,&amp;quot; we extracted 0.5 sentences per triple (maximum of 79) and 50.6 sentences per interaction (maximum of 119); for &amp;quot;citances&amp;quot; we extracted 0.4 sentences per triple (with a maximum of 105) and 49.2 sentences per interaction (162 maximum). We required a minimum number (40) of sentences for each interaction type for both &amp;quot;papers&amp;quot; and &amp;quot;citances&amp;quot;; the 10 interactions of Table 2 met this requirement. We used these sentences to train and test the models described below7.</Paragraph>
    <Paragraph position="13"> Since all the sentences extracted from one triple are assigned the same interaction, we ensured that sentences from the same triple did not appear in both the testing and the training sets. Roughly 75% of the data were used for training and the rest for testing.</Paragraph>
    <Paragraph position="14"> As mentioned above the goal is to automatically associate to each triple an interaction type. The task tackled here is actually slightly more difficult: given some sentences extracted from article a3 , assign to a3 an interaction type a4 and extract the proteins a1a2a1 involved. In other words, for the purpose of classification, we act as if we do not have information about the proteins that interact. However, given the way the sentence extraction was done, all the sentences extracted from a3 contain the a1a2a1 .</Paragraph>
    <Paragraph position="15">  extracted the sentence containing the a5a8a5 along with the previous and the following sentences, and the three consecutive sentences that contained the a5a8a5 (the proteins could appear in any of the sentences). However, the results obtained by using these larger chunks were consistently worse.</Paragraph>
    <Paragraph position="16">  tein interaction classification (and role extraction).</Paragraph>
    <Paragraph position="17"> A hand-assessment of the individual sentences shows that not every sentence that mentions the target proteins a1a2a1 actually describes the interaction a4 (see Section 5.4). Thus the evaluation on the test set is done at the document level (to determine if the algorithm can predict the interaction that a curator would assign to a document as a whole given the protein pair).</Paragraph>
    <Paragraph position="18"> Note that we assume here that the papers that provide the evidence for the interactions are given - an assumption not usually true in practice.</Paragraph>
  </Section>
  <Section position="5" start_page="734" end_page="735" type="metho">
    <SectionTitle>
4 Models
</SectionTitle>
    <Paragraph position="0"> For assigning interactions, we used two generative graphical models and a discriminative model. Figure 1 shows the generative dynamic model, based on previous work on role and relation extraction (Rosario and Hearst, 2004) where the task was to extract the entities TREATMENT and DISEASE and the relationships between them. The nodes labeled &amp;quot;Role&amp;quot; represent the entities (in this case the choices are PROTEIN and NULL); the children of the role nodes are the words (which act as features), thus there are as many role states as there are words in the sentence; this model consists of a Markov sequence of states where each state generates one or multiple observations. This model makes the additional assumption that there is an interaction present in the sentence (represented by the node &amp;quot;Inter.&amp;quot;) that generates the role sequence and the observations. (We assume here that there is a single interaction for each sentence.) The &amp;quot;Role&amp;quot; nodes can be observed or hidden. The results reported here were obtained using only the words as features (i.e., in the dynamic model of Figure 1 there is only one feature node per role) and with the &amp;quot;Role&amp;quot; nodes hidden (i.e., we had no information regarding which proteins were involved). Inference is performed with the junction tree algorithm8.</Paragraph>
    <Paragraph position="1"> We used a second type of graphical model, a simple Naive Bayes, in which the node representing the interaction generates the observable features (all the words in the sentence). We did not include role information in this model.</Paragraph>
    <Paragraph position="2"> We defined joint probability distributions over these models, estimated using maximum likelihood on the training set with a simple absolute discounting smoothing method. We performed 10-fold cross validation on the training set and we chose the smoothing parameters for which we obtained the best classification accuracies (averaged over the ten runs) on the training data; the results reported here were obtained using these parameters on the held-out test sets9.</Paragraph>
    <Paragraph position="3"> In addition to these two generative models, we also used a discriminative model, a neural network.</Paragraph>
    <Paragraph position="4"> We used the Matlab package to train a feed-forward network with conjugate gradient descent. The network has one hidden layer, with a hyperbolic tangent function, and an output layer representing the relationships. A logistic sigmoid function is used in the output layer. The network was trained for several choices of numbers of hidden units; we chose the best-performing networks based on training set error. We then tested these networks on held-out testing data. The features were words, the same as those used for the graphical models.</Paragraph>
    <Paragraph position="5">  protein-protein interactions of Table 2. DM: dynamic model, NB: Naive Bayes, NN: neural network. Baselines: Key: trigger word approach, KeyB: trigger word with backoff, Base: the accuracy of choosing the most frequent interaction.</Paragraph>
    <Paragraph position="6"> The task is the following: given a triple consisting of a a1a2a1 and an article, extract the sentences from the article that contain both proteins. Then, predict for the entire document one of the interactions of Table 2 given the sentences extracted for that triple. This is a 10-way classification problem, which is significantly more complex than much of the related work in which the task is to make the binary prediction (see Section 2).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML