File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2011_metho.xml
Size: 20,476 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2011"> <Title>Combining labelled and unlabelled data: a case study on Fisher kernels and transductive inference for biological entity recognition</Title> <Section position="4" start_page="1" end_page="2" type="metho"> <SectionTitle> 2 Classication for entity extraction </SectionTitle> <Paragraph position="0"> We formulate the following (binary) classication problem: given an input space X, and from a dataset of N input-output pairs (x</Paragraph> <Paragraph position="2"> bility P (h(x)=y)over the xed but unknown joint input-output distribution of (x;;y) pairs.</Paragraph> <Paragraph position="3"> In this setting, binary classication is essentially a supervised learning problem.</Paragraph> <Paragraph position="4"> In order to map this to the biological entity recognition problem, we consider for each candidate term, the following binary decision problem: is the candidate a biological entity (y =1)ornot(y = ;1). The input space is a high dimensional feature space containing lexical, morpho-syntactic and contextual features. In order to assess the validityofcombining labelled and unlabelled data for the particular task of biological entity extraction, we use the following tools. First we rely on Suport Vector Machines together with transductive infer- null In our case, biological entities are proteins, genes and RNA, cf. section 6.</Paragraph> <Paragraph position="5"> ence (Vapnik, 1998;; Joachims, 1999), a training technique that takes both labelled and unlabelled data into account. Secondly,we develop a Fisher kernel (Jaakkola and Haussler, 1999), whichderives the similarity from an underlying (unsupervised) model ofthe data, used as asimilarity measure (akakernel) within SVMs. The learning process involves the following steps: Transductive inference: learn aSVM classier h(x) using the combined (labelled and unlabelled) dataset, using traditional ker- null nels.</Paragraph> <Paragraph position="6"> Fisher kernels: 1. Learn aprobabilistic model of the data P(xj) using combined unlabelled and labelled data;; 2. Derive the Fisher kernel K(x;;z) expressing the similarityinX-space;; 3. Learn a SVM classier h(x) using this Fisher kernel and inductive inference.</Paragraph> <Paragraph position="7"> 3 Probabilistic models for co-occurence data In (Gaussier et al., 2002) we presented a general hierarchical probabilistic model whichgeneralises several established models likeNave Bayes (Yang and Liu, 1999),probabilistic latent semantic analysis (PLSA) (Hofmann, 1999) or hierarchical mixtures (Toutanova et al., 2001). In this model, data result from the observation of co-occuring objects. For example, a document collection is expressed as co-occurences between documents and words;; in entity extraction, co-occuring objects may be potential entities and their context, for example. For co-occuring objects i and j, the model is expressed as follows:</Paragraph> <Paragraph position="9"> where are latent classes for co-occurrences (i;;j)and arelatentnodes in ahierarchy generating objects j. In the case where no hierarchy is needed (ie P(j)=( = )), the model reduces to PLSA:</Paragraph> <Paragraph position="11"> where are now latent concepts overboth i and j.Parameters of the model (class probabilities P() and class-conditional P(ij)andP(jj)) are learned using a deterministic annealing version of the expectation-maximisation (EM) algorithm (Hofmann, 1999;; Gaussier et al., 2002).</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Fisher kernels </SectionTitle> <Paragraph position="0"> Probabilistic generative models like PLSA and hierarchical extensions (Gaussier et al., 2002) provide a natural way to model the generation of the data, and allow the use of well-founded statistical tools to learn and use the model.</Paragraph> <Paragraph position="1"> In addition, they may be used to derivea model-based measure of similaritybetween examples, using the so-called Fisher kernels proposed byJaakkola and Haussler (1999). The idea behind this kernel is that using the structure implied by the generative model will give a more relevant similarity estimate, and allow kernel methods like the support vectormachines or nearest neighbours to leverage the probabilistic model and yield improved performance (Hofmann, 2000).</Paragraph> <Paragraph position="2"> The Fisher kernel is obtained using the log-likelihood of the model and the Fisher information matrix. Let us consider our collection of documents fx</Paragraph> <Paragraph position="4"> , and denote by `(x)= logP(xj) the log-likelihood of the model for data x. The expression of the Fisher kernel (Jaakkola and Haussler, 1999) is then:</Paragraph> <Paragraph position="6"> The Fisher information matrix I F can be seen as a waytokeep the kernel expression independent of parameterisation and is dened as</Paragraph> <Paragraph position="8"> , where the gradient is w.r.t. and the expectation is taken over P(xj). With a suitable parameterization, the information matrixIis usually approximated by the identity matrix (Hofmann, 2000), leading to the simpler kernel expression: K(x ).</Paragraph> <Paragraph position="9"> Depending on the model, the various log-likelihoods and their derivatives will yield different Fisher kernel expressions. For PLSA (2), the parameters are =[P();;P(ij);;P(jj)]. From the derivatives of the likelihood `(x)=</Paragraph> <Paragraph position="11"> logP(i;;j), we derive the following similarity (Hofmann, 2000):</Paragraph> <Paragraph position="13"/> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Transductive inference </SectionTitle> <Paragraph position="0"> In standard, inductive SVM inference, the annotated data is used to infer a model, whichis then applied to unannotated test data. The inference consists in a trade-o between the size of the margin (linked to generalisation abilities) and the number of training errors. Transductive inference (Gammerman et al., 1998;; Joachims, 1999) aims at maximising the margin between positives and negatives, while minimising not only the actual number of incorrect predictions on labelled examples, but also the expected number of incorrect predictions on the set of unannotated examples.</Paragraph> <Paragraph position="1"> This is done by including the unknown labels as extra variables in the original optimisation problem. In the linearly separable case, the new optimisation problem amounts now to nd a labelling of the unannotated examples and a hyperplane which separates all examples (annotatedand unannotated) with maximum margin.</Paragraph> <Paragraph position="2"> In the non-separable case, slackvariables are also associated to unannotated examples and the optimisation problem is now to nd a labelling and a hyperplane which optimally solves the trade-o between maximising the margin and minimising the number of misclassied examples (annotated and unannotated).</Paragraph> <Paragraph position="3"> With the introduction of unknown labels as supplementary optimisation variables, the constraints of the quadratic optimisation problem are now nonlinear, whichmakes solving more dicult. However, approximated iterative algorithms exist which can eciently train Trans- null ductive SVMs. They are based on the principle ofgradually improving the solution byswitching the labels of unnannotated examples whichare misclassied at the current iteration, starting from an initial labelling given by the standard (inductive) SVM.</Paragraph> <Paragraph position="4"> WUp Is the word capitalized? WAllUp Is the word alls capitals? WNum Does the word contain digits?</Paragraph> </Section> <Section position="7" start_page="2" end_page="5" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> Forourexperiments, weused 184abstractsfrom the Medline site. In these articles, genes, proteins and RNAs were manually annotated bya biologist as part of the BioMIRE project. These articles contain 1405occurrences of gene names, 792 of protein names and 81 of RNA names. All these entities are considered relevant biological entities. We focus here on the task of identifying names corresponding to suchentities in running texts, without dierentiating genes from proteins or RNAs. Once candidates for biological entity names have been identied, this task amounts to a binary categorisation, relevant candidates corresponding to biological entity names. We divided these abstracts in a training and development set (122 abstracts), and a test set (62 abstracts). We then retained dierent portions of the training labels, to be used as labelled data, whereas the rest of the data is considered unlabelled.</Paragraph> <Section position="1" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 6.1 Denition of features </SectionTitle> <Paragraph position="0"> First of all, the abstracts are tokenised, tagged and lemmatized. Candidates for biological entity names are then selected on the basis of the following heuristics: a token is consideredacandidate if it appears in one of the biological lexicons we have at our diposal, or if it does not belong to our general English lexicon. This simple heuristics allows us to retain 93% (1521 out of 1642) of biological names in the training set (90% in the test set), while considering only 21% of all possible candidates (5845 out of 27350 tokens).</Paragraph> <Paragraph position="1"> It thus provides a good pre-lter which significantly improves the performance, in terms of speed, of our system. The biological lexicons we use were provided by the BioMIRE project, and were derived from the resources available at: http://iubio.bio.indiana.edu/.</Paragraph> <Paragraph position="2"> For each candidate, three types of features were considered. We rst retained the part-of-speech and some spelling information (table 1). These features were chosen based on the inspection of gene and protein names in our lexicons.</Paragraph> <Paragraph position="3"> The second type of features relates to the presence of the candidate in our lexical resources (table 2). Lastly, the third type of features describes contextual information. The context we consider contains the four preceding and the four following words. However, we did not take into account the position of the words in the context, but only their presence in the rightor left context, and in addition we replaced, whenever possible, eachword by a feature indicating (a) whether the word was part of the gene lexicon, (b) if not whether it was part of the protein lexicon, (c) if not whether it was part of the species lexicon, (d) and if not, whenever the candidate was neither a noun, an adjective nor averb, we replaced it by its part-of-speech.</Paragraph> <Paragraph position="4"> For example, the word hairless is associated with the features given in Table 3, when encountered in the following sentence: Inhibition of the DNA-binding activity of Drosophila suppressor of hairless and of its human homolog, KBF2/RBP-J kappa, by direct protein{protein interaction with Drosophila hairless. The word hairless appears in the gene lexicon and is wrongly recognized as an adjectiveby our tagger. null The word human, the fourth word of the rightcontext of hairless, belongs to the species lexicon, ans is thus replaced by the feature RC SPECIES.Neither Drosophila nor suppressor belong to the specialized lexicons we use, and, since they are both tagged as nouns, they are left unchanged. Prepositions and conjunctions are replaced by their part-of-speech, and prexes LC and RC indicate whether they were found in left or rightcontext. Note that since twoprepositions appear in the left context of hairless, the value of the LC PREP feature is 2.</Paragraph> <Paragraph position="5"> Altogether, this amounts to a total of 3690 possible features in the input space X.</Paragraph> <Paragraph position="6"> Using these lexicons alone, the same task with the same test data, yields: precision = 22%, recall = 76%. Note that no adaptaion work has been conducted on our tagger, which explains this error.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 6.2 Results </SectionTitle> <Paragraph position="0"> In our experiments, wehave used the following methods: SVM trained with inductive inference, and using a linear kernel, a polynomial kernel of degree d = 2 and the so-called \radial basis function&quot; kernel (Scholkopf and Smola, 2002).</Paragraph> <Paragraph position="1"> SVM trained with transductive inference, and using a linear kernel or a polynomial kernel of degree d =2.</Paragraph> <Paragraph position="2"> SVM trained with inductive inference using Fisher kernels estimated fromthe whole training data (without using labels), with dierentnumber of classes c in the PLSA model (4).</Paragraph> <Paragraph position="3"> The proportion of labelled data is indicated in the tables of results. For SVM with inductive inference, only the labelled portion is used. For transductive SVM (TSVM), the remaining, unlabelled portion is used (without the labels). For the Fisher kernels (FK), an unsupervised model is estimated on the full dataset using PLSA, and a SVM is trained with inductive inference on the labelled data only, using the Fisher kernel as similarity measure.</Paragraph> </Section> <Section position="3" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 6.3 Transductive inference </SectionTitle> <Paragraph position="0"> Table 4 gives interesting insightinto the effect of transductive inference. As expected, in the limit where little unannotated data is used (100% in the table), there is little to gain from using transductive inference. Accordingly,performance is roughly equivalent scores(in %) using dierent proportionsofannotateddataforthefollowing models: SVM with inductive inference (SVM) and linear (lin) kernel, second degree polynomial kernel (d=2), and RBF kernel (rbf);; SVM with transductive inference (TSVM) and linear (lin) kernel or second degree polynomial (d=2) kernel. null TSVM, with a slightadvantage for RBF kernel trained with inductive inference. Interestingly, in the other limit, ie when very little annotated data is used, transductive inference does not seem to yield a marked improvementover inductive learning. This nding seems somehow at odds with the results reported by Joachims (1999) on a dierent task (text categorisation). Weinterpret this result as a side-eect of the search strategy, where one tries to optimise both the size of the margin and the labelling of the unannotated examples. In practice, an exact optimisation over this labelling is impractical, and when a large amount of unlabelled data is used, there is a risk that the approximate, sub-optimal search strategy described by Joachims (1999)mayfail toyield a solution that is markedly better that the result of inductive inference.</Paragraph> <Paragraph position="1"> For the twointermediate situation, however, transductive inference seems to provide a sizeable performanceimprovement. Using only 24% of annotated data, transductive learning is able totrain alinear kernel SVM thatyields approximately the same performance as inductive inference on the full annotated dataset. This means that we get comparable performance using only what corresponds to about 30 abstracts, compared to the 122 of the full training set.</Paragraph> </Section> <Section position="4" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 6.4 Fisher kernels </SectionTitle> <Paragraph position="0"> The situation is somewhat dierent for SVM trained with inductive inference, but using Performance is not strictly equivalent because SVM and TSVM use the data dierently when optimising the scores(in %) using dierent proportions of annotated data for the following models: standard SVM with linear (lin) and second degree polynomial kernel (d=2);; Combination of linear kernel and Fisher kernel obtained from a PLSA with 4 classes (lin+FK4) or 8 classes (lin+FK8), and combination of linear and all Fisher kernels obtained from PLSA using 4, 8, 12 and 16 classes (lin+combi).</Paragraph> <Paragraph position="1"> Fisher kernels obtained from a model of the entire (non-annotated) dataset. As the use of Fisher kernels alone was unable to consistently achieve acceptable results, the similarity we used is a combination of the standard linear kernel and the Fisher kernel (a similar solution was advocate by Hofmann (2000)). Table 5 summarises the results obtained using several types of Fisher kernels, depending on howmany classes were used in PLSA. FK8 (resp. FK16) indicates the model using 8 (resp. 16) classes, while combi is a combination of the Fisher kernels obtained using 4, 8, 12 and 16 classes. The eect of Fisher kernels is not as clear-cut as that of transductive inference. For fully annotated data, we obtain results that are similar to the standard kernels, although often better than the linear kernel. Results obtained using 1.5%and 6%annotateddataseem somewhatinconsistent, whith a large improvement for 1.5%, but a marked degradation for 6%, suggesting that in that case, adding labels actually hurts performance. We conjecture that this maybe an artifact of the specic annotated set we selected. For 24% annotated data, the Fisher kernel provides results that are inbetween inductive and transductive inference using standard kernels.</Paragraph> </Section> </Section> <Section position="8" start_page="5" end_page="5" type="metho"> <SectionTitle> 7 Discussion </SectionTitle> <Paragraph position="0"> The results of our experiments are encouraging in that they suggest that both transductive inference and the use of Fisher kernels are potentially eectiveway of taking unannotated data into account to improve performance.</Paragraph> <Paragraph position="1"> These experimental results suggestthe following remark. Note that Fisher kernels can be implemented by a simple scalar product (linear kernel) between Fisher scores r`(x) (equation 3). The question arises naturally as to whether using non-linear kernels may improve results. One one hand, Fisher kernels are derived from information-geometric arguments (Jaakkola and Haussler, 1999) which require that the kernel reduces to an inner-product of Fisher scores. On the other hand, polynomial and RBF kernels often display better performance than a simple dot-product. In order to test this, wehave performed experiments using the same features as in section 6.4, but with a second degree polynomial kernel. Overall, results are consistently worse than before, which suggest that the expression of the Fisher kernel as the inner product of Fisher scores is theoretically well-founded and empirically justied.</Paragraph> <Paragraph position="2"> Among possible future work, let us mention the following technical points: 1. Optimising the weight of the contributions of the linear kernel and Fisher kernel, eg as K(x;;y)=hx;;yi +(1; )FK(x;;y), 2 [0;;1].</Paragraph> <Paragraph position="3"> 2. Understanding why the Fisher kernel alone (ie without interpolation with the linear kernel) is unable to provide a performance boost, despite attractive theoretical properties. null In addition, the performance improvement obtained by both transductive inference and Fisher kernels suggest to use both in cunjunction. To our knowledge, the question of whether this would allow to \bootstrap&quot; the unlabelled data by using them twice (once for estimating the kernel, once in transductive learning) is still an open research question.</Paragraph> <Paragraph position="4"> Finally, regarding the application that we have targeted, namely entity recognition, the use of additional unlabelled data may help us to overcome the current performance limit on our database. None of the additional experiments conducted internally using probabilisitc models and symbolic, rule-based methods have been able to yield F scores higher than 63-64% on the same data. In order to improve on this, wehave collected several hundred additional abstracts by querying the MedLine database.</Paragraph> <Paragraph position="5"> After pre-processing, this yields more than a hundred thousand (unlabelled) candidates that wemay use with transductive inference and/or Fisher kernels.</Paragraph> </Section> class="xml-element"></Paper>