File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1092_evalu.xml
Size: 12,298 bytes
Last Modified: 2025-10-06 13:59:19
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1092"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 732-739, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Multi-way Relation Classification: Application to Protein-Protein Interactions</Title> <Section position="6" start_page="735" end_page="737" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The evaluation was done on a document-bydocument basis. During testing, we choose the interaction using the following aggregate measures that use the constraint that all sentences coming from the same triple are assigned the same interaction.</Paragraph> <Paragraph position="1"> a0 Mj: For each triple, for each sentence of the triple, find the interaction that maximizes the posterior probability of the interaction given the features; then assign to all sentences of this triple the most frequent interaction among those predicted for the individual sentences.</Paragraph> <Paragraph position="2"> a0 Cf: Retain all the conditional probabilities (do not choose an interaction per sentence), then, for each triple, choose the interaction that maximizes the sum over all the triple's sentences.</Paragraph> <Paragraph position="3"> Table 3 reports the results in terms of classification accuracies averaged across all interactions, for the cases &quot;all&quot; (sentences from &quot;papers&quot; and &quot;citances&quot; together), only &quot;papers&quot; and only &quot;citances&quot;. The accuracies are quite high; the dynamic model achieves around 60% for &quot;all,&quot; 58% for &quot;papers&quot; and 54% for &quot;citances.&quot; The neural net achieves the best results for &quot;all&quot; with around 64% accuracy. From these results we can make the following observations: all models greatly outperform the baselines; the performances of the dynamic model DM, the Naive Bayes NB and the NN are very similar; for &quot;papers&quot; the best results were obtained with the graphical models; for &quot;all&quot; and &quot;citances&quot; the neural net did best. The use of &quot;citances&quot; allowed the gathering of additional data (and therefore a larger training set) that lead to higher accuracies (see &quot;papers&quot; versus &quot;all&quot;).</Paragraph> <Paragraph position="4"> In the confusion matrix in Table 5 we can see the accuracies for the individual interactions for the dynamic model DM, using &quot;all&quot; and &quot;Mj.&quot; For three interactions this model achieves perfect accuracy.</Paragraph> <Section position="1" start_page="735" end_page="735" type="sub_section"> <SectionTitle> 5.1 Hiding the protein names </SectionTitle> <Paragraph position="0"> In order to ensure that the algorithm was not over-fitting on the protein names, we ran an experiment in which we replaced the protein names in all sentences with the token &quot;PROT NAME.&quot; For example, the sentence: &quot;Selective CXCR4 antagonism by Tat&quot; became: &quot;Selective PROT NAME2 antagonism by PROT NAME1.&quot; Table 5.1 shows the results of running the models on this data. For &quot;papers&quot; and &quot;citances&quot; there is always a decrease in the classification accuracy when we remove the protein names, showing that the protein names do help the classification. The differences in accuracy in the two cases using &quot;citances&quot; are much smaller than the differences using &quot;papers&quot; at least for the graphical models. This suggests that citation sentences may be more robust for some language processing tasks and that the models that use &quot;citances&quot; learn better the linguistic context of the interactions. Note how in this case the graphical models always outperform the neural network.</Paragraph> </Section> <Section position="2" start_page="735" end_page="736" type="sub_section"> <SectionTitle> 5.2 Using a &quot;trigger word&quot; approach </SectionTitle> <Paragraph position="0"> As mentioned above, much of the related work in this field makes use of &quot;trigger words&quot; or &quot;interaction words&quot; (see Section 2). In order to (roughly) compare our work and to build a more realistic baseline, we created a list of 70 keywords that are repre- null names removed. Columns marked Diff show the difference in accuracy (in percentages) with respect to the original case of Table 3, averaged over all evaluation methods.</Paragraph> <Paragraph position="1"> sentative of the 10 interactions. For example, for the interaction degrade some of the keywords are &quot;degradation,&quot; &quot;degrade,&quot; for inhibit we have &quot;inhibited,&quot; &quot;inhibitor,&quot; &quot;inhibitory&quot; and others. We then checked whether a sentence contained such keywords. If it did, we assigned to the sentence the corresponding interaction. If it contained more than one keyword corresponding to multiple interactions consisting of the generic interact with plus a more specific one, we assigned the more specific interaction; if the two predicted interactions did not include interact with but two more specific interactions, we did not assign an interaction, since we wouldn't know how to choose between them. Similarly, we assigned no interaction if there were more than two predicted interactions or no keywords present in the sentence. The results are shown in the rows labeled &quot;Key&quot; and &quot;KeyB&quot; in Table 3. Case &quot;KeyB&quot; is the &quot;Key&quot; method with back-off: when no interaction was predicted, we assigned to the sentence the most frequent interaction in the training data. As before, we calculated the accuracy when we force all the sentences from one triple to be assign to the most frequent interaction among those predicted for the individual sentences.</Paragraph> <Paragraph position="2"> KeyB is more accurate than Key and although the KeyB accuracies are higher than the other baselines, they are significantly lower than those obtained with the trained models. The low accuracies of the trigger-word based methods show that the relation classification task is nontrivial, in the sense that not all the sentences contain the most obvious word for the interactions, and suggests that the trigger word approach is insufficient.</Paragraph> </Section> <Section position="3" start_page="736" end_page="737" type="sub_section"> <SectionTitle> 5.3 Protein extraction </SectionTitle> <Paragraph position="0"> The dynamic model of Figure 1 has the appealing property of simultaneously performing interaction recognition and protein name tagging (also known as role extraction): the task consists of identifying all the proteins present in the sentence, given a sequence of words. We assessed a slightly different task: the identification of all (and only) the proteins present in the sentence that are involved in the interaction. null The F-measure10 achieved by this model for this task is 0.79 for &quot;all,&quot; 0.67 for &quot;papers&quot; and 0.79 for &quot;citances&quot;; again, the model parameters were chosen with cross validation on the training set, and &quot;ci10The F-measure is a weighted combination of precision and recall. Here, precision and recall are given equal weight, that is,</Paragraph> <Paragraph position="2"> tances&quot; had superior performance. Note that we did not use a dictionary: the system learned to recognize the protein names using only the training data.</Paragraph> <Paragraph position="3"> Moreover, our role evaluation is quite strict: every token is assessed and we do not assign partial credit for constituents for which only some of the words are correctly labeled. We also did not use the information that all the sentences extracted from one triple contain the same proteins.</Paragraph> <Paragraph position="4"> Given these strong results (both F-measure and classification accuracies), we believe that the dynamic model of Figure 1 is a good model for performing both name tagging and interaction classification simultaneously, or either of these task alone.</Paragraph> </Section> <Section position="4" start_page="737" end_page="737" type="sub_section"> <SectionTitle> 5.4 Sentence-level evaluation </SectionTitle> <Paragraph position="0"> In addition to assigning interactions to protein pairs, we are interested in sentence-level semantics, that is, in determining the interactions that are actually expressed in the sentence. To test whether the information assigned to the entire document by the HIV-1 database record can be used to infer information at the sentence level, an annotator with biological expertise hand-annotated the sentences from the experiments. The annotator was instructed to assign to each sentence one of the interactions of Table 2, &quot;not interacting,&quot; or &quot;other&quot; (if the interaction between the two proteins was not one of Table 2).</Paragraph> <Paragraph position="1"> Of the 2114 sentences that were hand-labeled, 68.3% of them disagreed with the HIV-1 database label, 28.4% agreed with the database label, and 3.3% were found to contain multiple interactions between the proteins. Among the 68.3% of the sentences for which the labels did not agree, 17.4% had the vague interact with relation, 7.4% did not contain any interaction and 43.5% had an interaction different from that specified by the triple11. In Table 6 we report the mismatch between the two sets of labels. The total accuracy of 38.9%12 provides a useful baseline for using a database for the labeling at the sentence level. It may be the case that certain interactions tend to be biologically related and thus 11For 28% of the triples, none of the sentences extracted from the target paper were found by the annotator to contain the interaction given by the database. We read four of these papers and found sentences containing that interaction, but our system had failed to extract them.</Paragraph> <Paragraph position="2"> are trained and tested on the hand labeled sentences.</Paragraph> <Paragraph position="3"> tend to co-occur (upregulate and stimulate or inactivate and inhibit, for example).</Paragraph> <Paragraph position="4"> We investigated a few of the cases in which the labels were &quot;suspiciously&quot; different, for example a case in which the database interaction was stimulate but the annotator found the same proteins to be related by inhibit as well. It turned out that the authors of the article assigned stimulate found little evidence for this interaction (in favor of inhibit), suggesting an error in the database. In another case the database interaction was require but the authors of the article, while supporting this, found that under certain conditions (when a protein is too abundant) the interaction changes to one of inhibit. Thus we were able to find controversial facts about protein interactions just by looking at the confusion matrix of Table 6.</Paragraph> <Paragraph position="5"> We trained the models using these hand-labeled sentences in order to determine the interaction expressed for each sentence (as opposed to for each document). This is a difficult task; for some sentences it took the annotator several minutes to understand them and decide which interaction applied.</Paragraph> <Paragraph position="6"> Table 7 shows the results on running the classification models on the six interactions for which there were more than 40 examples in the training sets. Again, the sentences from &quot;papers&quot; are especially difficult to classify; the best result for &quot;papers&quot; is 36.7% accuracy versus 63.2% accuracy for &quot;citances.&quot; In this case the difference in performance of &quot;papers&quot; and &quot;citances&quot; is larger than for the previous task of document-level relation classification.</Paragraph> </Section> </Section> class="xml-element"></Paper>