File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1044_metho.xml
Size: 26,002 bytes
Last Modified: 2025-10-06 14:10:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1044"> <Title>Automatic Classification of Verbs in Biomedical Texts</Title> <Section position="4" start_page="345" end_page="346" type="metho"> <SectionTitle> 2 The Biomedical Domain and Our Task </SectionTitle> <Paragraph position="0"> Recent years have seen a massive growth in the scientific literature in the domain of biomedicine.</Paragraph> <Paragraph position="1"> For example, the MEDLINE database2 which currently contains around 16M references to journal articles, expands with 0.5M new references each year. Because future research in the biomedical sciences depends on making use of all this existing knowledge, there is a strong need for the develop- null published experimental results.</Paragraph> <Paragraph position="2"> In recent years, major progress has been made on information retrieval and on the extraction of specific relations e.g. between proteins and cell types from biomedical texts (Hirschman et al., 2002). Other tasks, such as the extraction of factual information, remain a bigger challenge. This is partly due to the challenging nature of biomedical texts. They are complex both in terms of syntax and semantics, containing complex nominals, modal subordination, anaphoric links, etc.</Paragraph> <Paragraph position="3"> Researchers have recently began to use deeper NLP techniques (e.g. statistical parsing) in the domain because they are not challenged by the complex structures to the same extent than shallow techniques (e.g. regular expression patterns) are (Lease and Charniak, 2005). However, deeper techniques require richer domain-specific lexical information for optimal performance than is provided by existing lexicons (e.g. UMLS). This is particularly important for verbs, which are central to the structure and meaning of sentences.</Paragraph> <Paragraph position="4"> Where the lexical information is absent, lexical classes can compensate for it or aid in obtaining it in the ways described in section 1. Consider e.g. the INDICATE and ACTIVATE verb classes in Figure 1. They capture the fact that their members are similar in terms of syntax and semantics: they have similar SCFs and selectional preferences, and they can be used to make similar statements which describe similar events. Such information can be used to build a richer lexicon capable of supporting key tasks such as parsing, predicate-argument identification, event extraction and the identification of biomedical (e.g. interaction) relations. While an abundance of work has been conducted on semantic classification of biomedical terms and nouns, less work has been done on the (manual or automatic) semantic classification of verbs in the biomedical domain (Friedman et al., 2002; Hatzivassiloglou and Weng, 2002; Spasic et al., 2005). No previous work exists in this domain on the type of lexical (i.e. syntactic-semantic) verb classification this paper focuses on.</Paragraph> <Paragraph position="5"> To get an initial idea about the differences between our target classification and a general language classification, we examined the extent to which individual verbs and their frequencies differ in biomedical and general language texts. We biomedical data and in the BNC created a corpus of 2230 biomedical journal articles (see section 4.1 for details) and compared the distribution of verbs in this corpus with that in the British National Corpus (BNC) (Leech, 1992). We calculated the Spearman rank correlation between the 1165 verbs which occurred in both corpora.</Paragraph> <Paragraph position="6"> The result was only a weak correlation: 0.37 +0.03. When the scope was restricted to the 100 most frequent verbs in the biomedical data, the correlation was 0.12 +- 0.10 which is only 1.2s away from zero. The dissimilarity between the distributions is further indicated by the Kullback-Leibler distance of 0.97. Table 1 illustrates some of these big differences by showing the list of 15 most frequent verbs in the two corpora.</Paragraph> </Section> <Section position="5" start_page="346" end_page="347" type="metho"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"> We extended the system of Korhonen et al. (2003) with additional clustering techniques (introduced in sections 3.2.2 and 3.2.4) and used it to obtain the classification for the biomedical domain.</Paragraph> <Paragraph position="1"> The system (i) extracts features from corpus data and (ii) clusters them using five different methods.</Paragraph> <Paragraph position="2"> These steps are described in the following two sections, respectively.</Paragraph> <Section position="1" start_page="346" end_page="346" type="sub_section"> <SectionTitle> 3.1 Feature Extraction </SectionTitle> <Paragraph position="0"> We employ as features distributions of SCFs specific to given verbs. We extract them from corpus data using the comprehensive subcategorization acquisition system of Briscoe and Carroll (1997) (Korhonen, 2002). The system incorporates RASP, a domain-independent robust statistical parser (Briscoe and Carroll, 2002), which tags, lemmatizes and parses data yielding complete though shallow parses and a SCF classifier which incorporates an extensive inventory of 163 verbal SCFs3. The SCFs abstract over specific lexically-governed particles and prepositions and specific predicate selectional preferences. In our work, we parameterized two high frequency SCFs for prepositions (PP and NP + PP SCFs). No filtering of potentially noisy SCFs was done to provide clustering with as much information as possible.</Paragraph> </Section> <Section position="2" start_page="346" end_page="347" type="sub_section"> <SectionTitle> 3.2 Classification </SectionTitle> <Paragraph position="0"> The SCF frequency distributions constitute the input data to automatic classification. We experiment with five clustering methods: the simple hard nearest neighbours method and four probabilistic methods - two variants of Probabilistic Latent Semantic Analysis and two information theoretic methods (the Information Bottleneck and the Information Distortion).</Paragraph> <Paragraph position="1"> The first method collects the nearest neighbours (NN) of each verb. It (i) calculates the Jensen-Shannon divergence (JS) between the SCF distributions of each pair of verbs, (ii) connects each verb with the most similar other verb, and finally (iii) finds all the connected components. The NN method is very simple. It outputs only one clustering configuration and therefore does not allow examining different cluster granularities.</Paragraph> <Paragraph position="2"> The Probabilistic Latent Semantic Analysis (PLSA, Hoffman (2001)) assumes a generative model for the data, defined by selecting (i) a verb verbi, (ii) a semantic class classk from the distribution p(Classes|verbi), and (iii) a SCF scfj from the distribution p(SCFs|classk). PLSA uses Expectation Maximization (EM) to find the distribution ~p(SCFs|Clusters, Verbs) which maximises the likelihood of the observed counts. It does this by minimising the cost function</Paragraph> <Paragraph position="4"> For b = 1 minimising F is equivalent to the standard EM procedure while for b < 1 the distribution ~p tends to be more evenly spread. We use</Paragraph> <Paragraph position="6"> We currently &quot;harden&quot; the output and assign each verb to the most probable cluster only4.</Paragraph> <Paragraph position="7"> The Information Bottleneck (Tishby et al., 1999) (IB) is an information-theoretic method which controls the balance between: (i) the loss of information by representing verbs as clusters (I(Clusters;Verbs)), which has to be minimal, and (ii) the relevance of the output clusters for representing the SCF distribution (I(Clusters; SCFs)) which has to be maximal.</Paragraph> <Paragraph position="8"> The balance between these two quantities ensures optimal compression of data through clusters. The trade-off between the two constraints is realized through minimising the cost function:</Paragraph> <Paragraph position="10"> where b is a parameter that balances the constraints. IB takes three inputs: (i) SCF-verb distributions, (ii) the desired number of clusters K, and (iii) the initial value of b. It then looks for the minimal b that decreases LIB compared to its value with the initial b, using the given K. IB delivers as output the probabilities p(K|V). It gives an indication for the most informative number of output configurations: the ones for which the relevance information increases more sharply between K[?]1 and K clusters than between K and K+1.</Paragraph> <Paragraph position="11"> The Information Distortion method (Dimitrov and Miller, 2001) (ID) is otherwise similar to IB but LID differs from LIB by an additional term that adds a bias towards clusters of similar size:</Paragraph> <Paragraph position="13"/> </Section> </Section> <Section position="6" start_page="347" end_page="350" type="metho"> <SectionTitle> 4 Experimental Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="347" end_page="347" type="sub_section"> <SectionTitle> 4.1 Data </SectionTitle> <Paragraph position="0"> We downloaded the data for our experiment from the MEDLINE database, from three of the 10 lead- null retic methods. It made sense in this initial work on biomedical classification. In the future we could use soft clustering a means to investigate polysemy.</Paragraph> <Paragraph position="1"> ing journals in biomedicine: 1) Genes & Development (molecular biology, molecular genetics), 2) Journal of Biological Chemistry (biochemistry and molecular biology) and 3) Journal of Cell Biology (cellular structure and function). 2230 full-text articles from years 2003-2004 were used. The data included 11.5M words and 323,307 sentences in total. 192 medium to high frequency verbs (with the minimum of 300 occurrences in the data) were selected for experimentation5. This test set was big enough to produce a useful classification but small enough to enable thorough evaluation in this first attempt to classify verbs in the biomedical domain. null</Paragraph> </Section> <Section position="2" start_page="347" end_page="347" type="sub_section"> <SectionTitle> 4.2 Processing the Data </SectionTitle> <Paragraph position="0"> The data was first processed using the feature extraction module. 233 (preposition-specific) SCF types appeared in the resulting lexicon, 36 per verb on average.6 The classification module was then applied. NN produced Knn = 42 clusters. From the other methods we requested K = 2 to 60 clusters. We chose for evaluation the outputs corresponding to the most informative values of K: 20, 33, 53 for IB, and 17, 33, 53 for ID.</Paragraph> </Section> <Section position="3" start_page="347" end_page="348" type="sub_section"> <SectionTitle> 4.3 Gold Standard </SectionTitle> <Paragraph position="0"> Because no target lexical classification was available for the biomedical domain, human experts (4 domain experts and 2 linguists) were used to create the gold standard. They were asked to examine whether the test verbs similar in terms of their syntactic properties (i.e. verbs with similar SCF distributions) are similar also in terms of semantics (i.e. they share a common meaning). Where this was the case, a verb class was identified and named.</Paragraph> <Paragraph position="1"> The domain experts examined the 116 verbs whose analysis required domain knowledge (e.g. activate, solubilize, harvest), while the linguists analysed the remaining 76 general or scientific text verbs (e.g. demonstrate, hypothesize, appear). The linguists used Levin (1993) classes as gold standard classes whenever possible and created novel ones when needed. The domain experts used two purely semantic classifications of biomedical verbs (Friedman et al., 2002; Spasic et al., 2005)7 as a starting point where this was pos5230 verbs were employed initially but 38 were dropped later so that each (coarse-grained) class would have the minimum of 2 members in the gold standard.</Paragraph> <Paragraph position="2"> 1 Have an effect on activity (BIO/29) 8 Physical Relation 1.1 Activate / Inactivate Between Molecules (BIO/20) 1.1.1 Change activity: activate, inhibit 8.1 Binding: bind, attach 1.1.2 Suppress: suppress, repress 8.2 Translocate and Segregate 1.1.3 Stimulate: stimulate 8.2.1 Translocate: shift, switch 1.1.4 Inactivate: delay, diminish 8.2.2 Segregate: segregate, export 1.2 Affect 8.3 Transmit 1.2.1 Modulate: stabilize, modulate 8.3.1 Transport: deliver, transmit 1.2.2 Regulate: control, support 8.3.2 Link: connect, map 1.3 Increase / decrease: increase, decrease 9 Report (GEN/30) 1.4 Modify: modify, catalyze 9.1 Investigate 2 Biochemical events (BIO/12) 9.1.1 Examine: evaluate, analyze 2.1 Express: express, overexpress 9.1.2 Establish: test, investigate 2.2 Modification 9.1.3 Confirm: verify, determine 2.2.1 Biochemical modification: 9.2 Suggest dephosphorylate, phosphorylate 9.2.1 Presentational: 2.2.2 Cleave: cleave hypothesize, conclude 2.3 Interact: react, interfere 9.2.2 Cognitive: 3 Removal (BIO/6) consider, believe 3.1 Omit: displace, deplete 9.3 Indicate: demonstrate, imply 3.2 Subtract: draw, dissect 10 Perform (GEN/10) 4 Experimental Procedures (BIO/30) 10.1 Quantify 4.1 Prepare 10.1.1 Quantitate: quantify, measure 4.1.1 Wash: wash, rinse 10.1.2 Calculate: calculate, record 4.1.2 Mix: mix 10.1.3 Conduct: perform, conduct 4.1.3 Label: stain, immunoblot 10.2 Score: score, count 4.1.4 Incubate: preincubate, incubate 11 Release (BIO/4): detach, dissociate 4.1.5 Elute: elute 12 Use (GEN/4): utilize, employ 4.2 Precipitate: coprecipitate 13 Include (GEN/11) coimmunoprecipitate 13.1 Encompass: encompass, span 4.3 Solubilize: solubilize,lyse 13.2 Include: contain, carry 4.4 Dissolve: homogenize, dissolve 14 Call (GEN/3): name, designate 4.5 Place: load, mount 15 Move (GEN/12) 5 Process (BIO/5): linearize, overlap 15.1 Proceed: 6 Transfect (BIO/4): inject, microinject progress, proceed 7 Collect (BIO/6) 15.2 Emerge: 7.1 Collect: harvest, select arise, emerge 7.2 Process: centrifuge, recover 16 Appear (GEN/6): appear, occur few example verbs per class sible (i.e. where they included our test verbs and also captured their relevant senses)8.</Paragraph> <Paragraph position="3"> The experts created a 3-level gold standard which includes both broad and finer-grained classes. Only those classes / memberships were included which all the experts (in the two teams) agreed on.9 The resulting gold standard including 16, 34 and 50 classes is illustrated in table 2 with 1-2 example verbs per class. The table indicates which classes were created by domain experts (BIO) and which by linguists (GEN). Each class was associated with 1-30 member verbs10.</Paragraph> <Paragraph position="4"> The total number of verbs is indicated in the table (e.g. 10 for PERFORM class).</Paragraph> </Section> <Section position="4" start_page="348" end_page="348" type="sub_section"> <SectionTitle> 4.4 Measures </SectionTitle> <Paragraph position="0"> The clusters were evaluated against the gold standard using measures which are applicable to all the 8Purely semantic classes tend to be finer-grained than lexical classes and not necessarily syntactic in nature. Only these two classifications were found to be similar enough to our target classification to provide a useful starting point. Section 5 includes a summary of the similarities/differences between our gold standard and these other classifications.</Paragraph> <Paragraph position="1"> 9Experts were allowed to discuss the problematic cases to obtain maximal accuracy - hence no inter-annotator agreement is reported.</Paragraph> <Paragraph position="2"> 10The minimum of 2 member verbs were required at the coarser-grained levels of 16 and 34 classes.</Paragraph> <Paragraph position="3"> classification methods and which deliver a numerical value easy to interpret.</Paragraph> <Paragraph position="4"> The first measure, the adjusted pairwise precision, evaluates clusters in terms of verb pairs:</Paragraph> <Paragraph position="6"> num. of correct pairs in ki num. of pairs in ki *</Paragraph> <Paragraph position="8"> APP is the average proportion of all within-cluster pairs that are correctly co-assigned. Multiplied by a factor that increases with cluster size it compensates for a bias towards small clusters.</Paragraph> <Paragraph position="9"> The second measure is modified purity, a global measure which evaluates the mean precision of clusters. Each cluster is associated with its prevalent class. The number of verbs in a cluster K that take this class is denoted by nprevalent(K). Verbs that do not take it are considered as errors. Clusters where nprevalent(K) = 1 are disregarded as not to introduce a bias towards singletons:</Paragraph> <Paragraph position="11"> number of verbs The third measure is the weighted class accuracy, the proportion of members of dominant clusters DOM-CLUSTi within all classes ci.</Paragraph> <Paragraph position="13"> verbs in DOM-CLUSTi number of verbs mPUR can be seen to measure the precision of clusters and ACC the recall. We define an F measure as the harmonic mean of mPUR and ACC:</Paragraph> <Paragraph position="15"> The statistical significance of the results is measured by randomisation tests where verbs are swapped between the clusters and the resulting clusters are evaluated. The swapping is repeated 100 times for each output and the average avswaps and the standard deviation sswaps is measured.</Paragraph> <Paragraph position="16"> The significance is the scaled difference signif = (result[?]avswaps)/sswaps .</Paragraph> </Section> <Section position="5" start_page="348" end_page="349" type="sub_section"> <SectionTitle> 4.5 Results from Quantitative Evaluation </SectionTitle> <Paragraph position="0"> Table 3 shows the performance of the five clustering methods for K = 42 clusters (as produced by the NN method) at the 3 levels of gold standard classification. Although the two PLSA variants (particularly PLSAb=0.75) produce a fairly accurate coarse grained classification, they perform worse than all the other methods at the finer-grained levels of gold standard, particularly according to the global measures. Being based on pairwise similarities, NN shows mostly better performance than IB and ID on the pairwise measure APP but the global measures are better for IB and ID. The differences are smaller in mPUR (yet significant: 2s between NN and IB and 3s between NN and ID) but more notable in ACC (which is e.g. 8 [?] 12% better for IB than for NN). Also the F results suggest that the two information theoretic methods are better overall than the simple NN method.</Paragraph> <Paragraph position="1"> IB and ID also have the advantage (over NN) that they can be used to produce a hierarchical verb classification. Table 4 shows the results for IB and ID for the informative values of K. The bold font indicates the results when the match between the values of K and the number of classes at the particular level of the gold standard is the closest. IB is clearly better than ID at all levels of gold standard. It yields its best results at the medium level (34 classes) with K = 33: F = 77 and APP = 69 (the results for ID are F = 72 and APP = 65). At the most fine-grained level (50 classes), IB is equally good according to F with K = 33, but APP is 8% lower. Although ID is occasionally better than IB according to APP and mPUR (see e.g. the results for 16 classes with K = 53) this never happens in the case where the correspondence between the number of gold standard classes and the values of K is the closest. In other words, the informative values of K prove really informative for IB. The lower performance of ID seems to be due to its tendency to create evenly sized clusters.</Paragraph> <Paragraph position="2"> All the methods perform significantly better than our random baseline. The significance of the results with respect to two swaps was at the 2s level, corresponding to a 97% confidence that the results are above random.</Paragraph> </Section> <Section position="6" start_page="349" end_page="350" type="sub_section"> <SectionTitle> 4.6 Qualitative Evaluation </SectionTitle> <Paragraph position="0"> We performed further, qualitative analysis of clusters produced by the best performing method IB.</Paragraph> <Paragraph position="1"> Consider the following clusters: A: inject, transfect, microinfect, contransfect (6) B: harvest, select, collect (7.1) centrifuge, process, recover (7.2) C: wash, rinse (4.1.1)</Paragraph> <Paragraph position="3"> D: activate (1.1.1) When looking at coarse-grained outputs, interestingly, K as low as 8 learned the broad distinction between biomedical and general language verbs (the two verb types appeared only rarely in the same clusters) and produced large semantically meaningful groups of classes (e.g. the coarse-grained classes EXPERIMENTAL PROCE-DURES, TRANSFECT and COLLECT were mapped together). K = 12 was sufficient to identify several classes with very particular syntax One of them was TRANSFECT (see A above) whose members were distinguished easily because of their typical SCFs (e.g. inject /transfect/microinfect/contransfect X with/into Y). On the other hand, even K = 53 could not identify classes with very similar (yet un-identical) syntax. These included many semantically similar sub-classes (e.g. the two sub-classes of COLLECT shown in B whose members take similar NP and PP SCFs). However, also a few semantically different verbs clustered wrongly because of this reason, such as the ones exemplified in C. In C, immunoblot (from the LABEL class) is still somewhat related to wash and rinse (the WASH class) because they all belong to the larger EXPERIMENTAL PRO-CEDURES class, but overlap (from the PROCESS class) shows up in the cluster merely because of syntactic idiosyncracy.</Paragraph> <Paragraph position="4"> While parser errors caused by the challenging biomedical texts were visible in some SCFs (e.g. looking at a sample of SCFs, some adjunct instances were listed in the argument slots of the frames), the cases where this resulted in incorrect classification were not numerous11.</Paragraph> <Paragraph position="5"> One representative singleton resulting from these errors is exemplified in D. Activate appears in relatively complicated sentence structures, which gives rise to incorrect SCFs. For example, MECs cultured on 2D planar substrates transiently activate MAP kinase in response to EGF, whereas... gets incorrectly analysed as SCF NP-NP, while The effect of the constitutively activated ARF6-Q67L mutant was investigated... receives the incorrect SCF analysis NP-SCOMP. Most parser errors are caused by unknown domain-specific words and phrases.</Paragraph> </Section> </Section> <Section position="7" start_page="350" end_page="350" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> Due to differences in the task and experimental setup, direct comparison of our results with previously published ones is impossible. The closest possible comparison point is (Korhonen et al., 2003) which reported 50-59% mPUR and 15-19% APP on using IB to assign 110 polysemous (general language) verbs into 34 classes. Our results are substantially better, although we made no effort to restrict our scope to monosemous verbs12 and although we focussed on a linguistically challenging domain.</Paragraph> <Paragraph position="1"> It seems that our better result is largely due to the higher uniformity of verb senses in the biomedical domain. We could not investigate this effect systematically because no manually sense 11This is partly because the mistakes of the parser are somewhat consistent (similar for similar verbs) and partly because the SCFs gather data from hundreds of corpus instances, many of which are analysed correctly.</Paragraph> <Paragraph position="2"> 12Most of our test verbs are polysemous according to WordNet (WN) (Miller, 1990), but this is not a fully reliable indication because WN is not specific to this domain.</Paragraph> <Paragraph position="3"> annotated data (or a comprehensive list of verb senses) exists for the domain. However, examination of a number of corpus instances suggests that the use of verbs is fairly conventionalized in our data13. Where verbs show less sense variation, they show less SCF variation, which aids the discovery of verb classes. Korhonen et al. (2003) observed the opposite with general language data.</Paragraph> <Paragraph position="4"> We examined, class by class, to what extent our domain-specific gold standard differs from the related general (Levin, 1993) and domain classifications (Spasic et al., 2005; Friedman et al., 2002) (recall that the latter were purely semantic classifications as no lexical ones were available for biomedicine): 33 (of the 50) classes in the gold standard are biomedical. Only 6 of these correspond (fully or mostly) to the semantic classes in the domain classifications. 17 are unrelated to any of the classes in Levin (1993) while 16 bear vague resemblance to them (e.g. our TRANSPORT verbs are also listed under Levin's SEND verbs) but are too different (semantically and syntactically) to be combined.</Paragraph> <Paragraph position="5"> 17 (of the 50) classes are general (scientific) classes. 4 of these are absent in Levin (e.g. QUAN-TITATE). 13 are included in Levin, but 8 of them have a more restricted sense (and fewer members) than the corresponding Levin class. Only the remaining 5 classes are identical (in terms of members and their properties) to Levin classes.</Paragraph> <Paragraph position="6"> These results highlight the importance of building or tuning lexical resources specific to different domains, and demonstrate the usefulness of automatic lexical acquisition for this work.</Paragraph> </Section> class="xml-element"></Paper>