File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1051_metho.xml
Size: 19,679 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1051"> <Title>Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature</Title> <Section position="4" start_page="402" end_page="403" type="metho"> <SectionTitle> 3 Kernel Methods for Relation Extraction </SectionTitle> <Paragraph position="0"> The basic idea behind kernel methods is to embed the input data into a suitable feature space F via a mapping function ph : X - F, and then use a linear algorithm for discovering nonlinear patterns. Instead of using the explicit mapping ph, we can use a kernel function K : X x X - R, that corresponds to the inner product in a feature space whichis,ingeneral,differentfromtheinputspace.</Paragraph> <Paragraph position="1"> Kernel methods allow us to design a modular system, in which the kernel function acts as an interface between the data and the learning algorithm. Thus the kernel function is the only domain specific module of the system, while the learning algorithm is a general purpose component. Potentially any kernel function can work with any kernel-based algorithm. In our approach we use Support Vector Machines (Vapnik, 1998).</Paragraph> <Paragraph position="2"> In order to implement the approach based on shallow linguistic information we employed a linear combination of kernels. Different works (Gliozzo et al., 2005; Zhao and Grishman, 2005; Culotta and Sorensen, 2004) empirically demonstrate the effectiveness of combining kernels in thisway, showingthatthecombinedkernelalways improves the performance of the individual ones.</Paragraph> <Paragraph position="3"> In addition, this formulation allows us to evaluate the individual contribution of each information source. We designed two families of kernels: Global Context kernels and Local Context kernels, in which each single kernel is explicitly calculated as follows</Paragraph> <Paragraph position="5"> where ph(*) is the embedding vector and bardbl*bardbl is the 2-norm. The kernel is normalized (divided) by the product of the norms of embedding vectors. The normalization factor plays an important role in allowing us to integrate information from heterogeneous feature spaces. Even though the resulting feature space has high dimensionality, an efficient computation of Equation 1 can be carried out explicitly since the input representations defined below are extremely sparse.</Paragraph> <Section position="1" start_page="402" end_page="403" type="sub_section"> <SectionTitle> 3.1 Global Context Kernel </SectionTitle> <Paragraph position="0"> In (Bunescu and Mooney, 2005b), the authors observed that a relation between two entities is generally expressed using only words that appear simultaneously in one of the following three patterns: null Fore-Between: tokens before and between the two candidate interacting entities. For instance: binding of [P1] to [P2], interaction involving [P1] and [P2], association of [P1] by [P2].</Paragraph> <Paragraph position="1"> Between: only tokens between the two candidate interacting entities. For instance: [P1] associates with [P2], [P1] binding to [P2], [P1], inhibitor of [P2].</Paragraph> <Paragraph position="2"> Between-After: tokens between and after the two candidate interacting entities. For instance: [P1] - [P2]association, [P1] and [P2] interact, [P1] has influence on [P2] binding.</Paragraph> <Paragraph position="3"> Our global context kernels operate on the patterns above, where each pattern is represented using a bag-of-words instead of sparse subsequences of words, PoS tags, entity and chunk types, or Word-Net synsets as in (Bunescu and Mooney, 2005b). More formally, given a relation example R, we represent a pattern P as a row vector</Paragraph> <Paragraph position="5"> where the function tf(ti,P) records how many times a particular token ti is used in P. Note that, this approach differs from the standard bag-of-words as punctuation and stop words are included in phP, while the entities (with attribute CANDIDATE and OTHER) are not. To improve the classification performance, we have further extended phP to embed n-grams of (contiguous) tokens (up to n = 3). By substituting phP into Equation 1, we obtain the n-gram kernel Kn, which counts common uni-grams, bi-grams, ..., n-grams that two patterns have in common2. The Global Context kernel KGC(R1,R2) is then defined as</Paragraph> <Paragraph position="7"> where KFB, KB and KBA are n-gram kernels that operate on the Fore-Between, Between and Between-After patterns respectively.</Paragraph> </Section> <Section position="2" start_page="403" end_page="403" type="sub_section"> <SectionTitle> 3.2 Local Context Kernel </SectionTitle> <Paragraph position="0"> The type of the candidate interacting entities can provide useful clues for detecting the agent and target of the relation, as well as the presence of the relation itself. As the type is not known, we use the information provided by the two local contexts of the candidate interacting entities, called left and right local context respectively. As typically done in entity recognition, we represent each local context by using the following basic features: Token The token itself.</Paragraph> <Paragraph position="1"> Lemma The lemma of the token.</Paragraph> <Paragraph position="2"> PoS The PoS tag of the token.</Paragraph> <Paragraph position="3"> Orthographic This feature maps each token into equivalence classes that encode attributes such as capitalization, punctuation, numerals and so on.</Paragraph> <Paragraph position="4"> Formally, given a relation example R, a local context L = t[?]w,...,t[?]1,t0,t+1,...,t+w is represented as a row vector</Paragraph> <Paragraph position="6"> where fi is a feature function that returns 1 if it is active in the specified position of L, 0 otherwise3.</Paragraph> <Paragraph position="7"> The Local Context kernel KLC(R1,R2) is defined as</Paragraph> <Paragraph position="9"> whereKleft andKright aredefinedbysubstituting the embedding of the left and right local context into Equation 1 respectively.</Paragraph> <Paragraph position="10"> of +-2 tokens around the candidate entity. Notice that KLC differs substantially from KGC as it considers the ordering of the tokens and the feature space is enriched with PoS, lemma and orthographic features.</Paragraph> </Section> <Section position="3" start_page="403" end_page="403" type="sub_section"> <SectionTitle> 3.3 Shallow Linguistic Kernel </SectionTitle> <Paragraph position="0"> Finally, the Shallow Linguistic kernel</Paragraph> <Paragraph position="2"> It follows directly from the explicit construction of the feature space and from closure properties of kernels that KSL is a valid kernel.</Paragraph> </Section> </Section> <Section position="5" start_page="403" end_page="404" type="metho"> <SectionTitle> 4 Data sets </SectionTitle> <Paragraph position="0"> Thetwodatasetsusedfortheexperimentsconcern the same domain (i.e. gene/protein interactions).</Paragraph> <Paragraph position="1"> However, they present a crucial difference which makes it worthwhile to show the experimental results on both of them. In one case (AImed) interactions are considered symmetric, while in the other (LLL challenge) agents and targets of genic interactions have to be identified.</Paragraph> <Section position="1" start_page="403" end_page="403" type="sub_section"> <SectionTitle> 4.1 AImed corpus </SectionTitle> <Paragraph position="0"> The first data set used in the experiments is the AImed corpus4, previously used for training protein interaction extraction systems in (Bunescu et al., 2005; Bunescu and Mooney, 2005b). It consists of 225 Medline abstracts: 200 are known to describe interactions between human proteins, while the other 25 do not refer to any interaction.</Paragraph> <Paragraph position="1"> There are 4,084 protein references and around 1,000 tagged interactions in this data set. In this data set there is no distinction between genes and proteins and the relations are symmetric.</Paragraph> </Section> <Section position="2" start_page="403" end_page="404" type="sub_section"> <SectionTitle> 4.2 LLL Challenge </SectionTitle> <Paragraph position="0"> This data set was used in the Learning Language in Logic (LLL) challenge on Genic Interaction extraction5 (Ned'ellec, 2005). The objective of the challenge was to evaluate the performance of systems based on machine learning techniques to identify gene/protein interactions and their roles, agent or target. The data set was collected by querying Medline on Bacillus subtilis transcription and sporulation. It is divided in a training set (80 sentences describing 271 interactions) and a test set (87 sentences describing 106 interactions). Differently from the training set, the test set contains sentences without interactions. The data set is decomposed in two subsets of increasing difficulty. The first subset does not include coreferences, while the second one includes simple cases of coreference, mainly appositions. Both subsets are available with different kinds of annotation: basic and enriched. The former includes word and sentence segmentation. The latter also includes manuallycheckedinformation, suchaslemmaand syntactic dependencies. A dictionary of named entities (including typographical variants and synonyms) is associated to the data set.</Paragraph> </Section> </Section> <Section position="6" start_page="404" end_page="406" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> Before describing the results of the experiments, a note concerning the evaluation methodology.</Paragraph> <Paragraph position="1"> There are different ways of evaluating performance in extracting information, as noted in (Lavelli et al., 2004) for the extraction of slot fillers in the Seminar Announcement and the Job Posting data sets. Adapting the proposed classification to relation extraction, the following two cases can be identified: * OneAnswerperOccurrenceintheDocument - OAOD (each individual occurrence of a protein interaction has to be extracted from the document); * One Answer per Relation in a given Document - OARD (where two occurrences of the same protein interaction are considered one correct answer).</Paragraph> <Paragraph position="2"> Figure 3 shows a fragment of tagged text drawn from the AImed corpus. It contains three different interactions between pairs of proteins, for a total of seven occurrences of interactions. For example, there are three occurrences of the interaction between IGF-IR and p52Shc (i.e. number 1, 3 and 7). If we adopt the OAOD methodology, all the seven occurrences have to be extracted to achieve the maximum score. On the other hand, if we use the OARD methodology, only one occurrence for each interaction has to be extracted to maximize the score.</Paragraph> <Paragraph position="3"> On the AImed data set both evaluations were performed, while on the LLL challenge only the OAOD evaluation methodology was performed because this is the only one provided by the evaluation server of the challenge.</Paragraph> <Paragraph position="4"> proteins and their interactions tagged. The protein names have been highlighted in bold face and their same subscript numbers indicate interaction between the proteins.</Paragraph> <Section position="1" start_page="404" end_page="404" type="sub_section"> <SectionTitle> 5.1 Implementation Details </SectionTitle> <Paragraph position="0"> All the experiments were performed using the SVM package LIBSVM6 customized to embed our own kernel. For the LLL challenge submission, we optimized the regularization parameter C by 10-fold cross validation; while we used its default value for the AImed experiment. In both experiments, we set the cost-factor Wi to be the ratio between the number of negative and positive examples. null</Paragraph> </Section> <Section position="2" start_page="404" end_page="405" type="sub_section"> <SectionTitle> 5.2 Results on AImed </SectionTitle> <Paragraph position="0"> KSL performance was first evaluated on the AImed data set (Section 4.1). We first give an evaluation of the kernel combination and then we compare our results with the Subsequence Kernel for Relation Extraction (ERK) described in (Bunescu and Mooney, 2005b). All experiments are conducted using 10-fold cross validation on the same data splitting used in (Bunescu et al., 2005; Bunescu and Mooney, 2005b).</Paragraph> <Paragraph position="1"> Table 1 shows the performance of the three kernels defined in Section 3 for protein-protein interactions using the two evaluation methodologies described above.</Paragraph> <Paragraph position="2"> data set using OARD evaluation methodology.</Paragraph> <Paragraph position="3"> Finally, Figure5showsthelearningcurveofthe combined kernel KSL using the OARD evaluation methodology. The curve reaches a plateau with around 100 Medline abstracts.</Paragraph> </Section> <Section position="3" start_page="405" end_page="405" type="sub_section"> <SectionTitle> 5.3 Results on LLL challenge </SectionTitle> <Paragraph position="0"> The system was evaluated on the &quot;basic&quot; version of the LLL challenge data set (Section 4.2).</Paragraph> <Paragraph position="1"> Table 2 shows the results of KSL returned by the scoring service8 for the three subsets of the training set (with and without coreferences, and with their union). Table 3 shows the best results obtained at the official competition performed in April 2005. Comparing the results we see that KSL trained on each subset outperforms the best systems of the LLL challenge9. Notice that the best results at the challenge were obtained by different groups and exploiting the linguistic &quot;enriched&quot; version of the data set. As observed in (Ned'ellec, 2005), the scores obtained using the training set without coreferences and the whole training set are similar.</Paragraph> <Paragraph position="2"> We also report in Table 4 an analysis of the kernel combination. Given that we are interested here in the contribution of each kernel, we evaluated the experiments by 10-fold cross-validation on the whole training set avoiding the submission process. null</Paragraph> </Section> <Section position="4" start_page="405" end_page="406" type="sub_section"> <SectionTitle> 5.4 Discussion of Results </SectionTitle> <Paragraph position="0"> The experimental results show that the combined kernel KSL outperforms the basic kernels KGC andKLC on both data sets. In particular, precision significantlyincreasesattheexpenseofalowerrecall. High precision is particularly advantageous when extracting knowledge from large corpora, because it avoids overloading end users with too many false positives.</Paragraph> <Paragraph position="1"> Although the basic kernels were designed to model complementary aspects of the task (i.e.</Paragraph> <Paragraph position="2"> 9After the challenge deadline, Reidel and Klein (2005) achieved a significant improvement, F1 = 68.4% (without coreferences) and F1 = 64.7% (with and without coreferences). null combination on the LLL challenge using 10-fold cross validation.</Paragraph> <Paragraph position="3"> presence of the relation and roles of the interacting entities), they perform reasonably well even when considered separately. In particular, KGC achievedgoodperformanceonbothdatasets. This result was not expected on the LLL challenge because this task requires not only to recognize the presence of relationships between entities but also to identify their roles. On the other hand, the outcomes of KLC on the AImed data set show that such kernel helps to identify the presence of relationships as well.</Paragraph> <Paragraph position="4"> At first glance, it may seem strange that KGC outperforms ERK on AImed, as the latter approach exploits a richer representation: sparse sub-sequences of words, PoS tags, entity and chunk types, or WordNet synsets. However, an approach based on n-grams is sufficient to identify the presence of a relationship. This result sounds less surprising, if we recall that both approaches cast the relation extraction problem as a text categorization task. Approaches to text categorization based on rich linguistic information have obtained less accuracy than the traditional bag-of-words approach (e.g. (Koster and Seutter, 2003)). Shallow linguistics information seems to be more effective to model the local context of the entities.</Paragraph> <Paragraph position="5"> Finally, we obtained worse results performing dimensionality reduction either based on generic linguistic assumptions (e.g. by removing words from stop lists or with certain PoS tags) or using statistical methods (e.g. tf.idf weighting schema).</Paragraph> <Paragraph position="6"> Thismaybeexplainedbythefactthat,intaskslike entity recognition and relation extraction, useful clues are also provided by high frequency tokens, such as stop words or punctuation marks, and by the relative positions in which they appear.</Paragraph> </Section> </Section> <Section position="7" start_page="406" end_page="407" type="metho"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> First of all, the obvious references for our work are the approaches evaluated on AImed and LLL challenge data sets.</Paragraph> <Paragraph position="1"> In (Bunescu and Mooney, 2005b), the authors present a generalized subsequence kernel that works with sparse sequences containing combinations of words and PoS tags.</Paragraph> <Paragraph position="2"> The best results on the LLL challenge were obtained by the group from the University of Edinburgh (Reidel and Klein, 2005), which used Markov Logic, a framework that combines log-linear models and First Order Logic, to create a set of weighted clauses which can classify pairs of gene named entities as genic interactions. These clauses are based on chains of syntactic and semantic relations in the parse or Discourse Representation Structure (DRS) of a sentence, respectively. null Other relevant approaches include those that adopt kernel methods to perform relation extraction. Zelenko et al. (2003) describe a relation extraction algorithm that uses a tree kernel defined over a shallow parse tree representation of sentences. The approach is vulnerable to unrecoverable parsing errors. Culotta and Sorensen (2004) describe a slightly generalized version of this kernel based on dependency trees, in which a bag-of-words kernel is used to compensate for errors in syntactic analysis. A further extension is proposed by Zhao and Grishman (2005). They use composite kernels to integrate information from different syntactic sources (tokenization, sentence parsing, and deep dependency analysis) so that processing errors occurring at one level may be overcome by information from other levels. Bunescu and Mooney (2005a) present an alternative approach which uses information concentrated in the shortest path in the dependency tree between the two entities.</Paragraph> <Paragraph position="3"> As mentioned in Section 1, another relevant approach is presented in (Roth and Yih, 2002). Classifiers that identify entities and relations among them are first learned from local information in the sentence. This information, along with constraints induced among entity types and relations, is used to perform global probabilistic inference that accounts for the mutual dependencies among the entities.</Paragraph> <Paragraph position="4"> All the previous approaches have been evaluated on different data sets so that it is not possible to have a clear idea of which approach is better than the other.</Paragraph> </Section> class="xml-element"></Paper>