File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1301_metho.xml
Size: 16,615 bytes
Last Modified: 2025-10-06 14:10:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1301"> <Title>Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data</Title> <Section position="4" start_page="2" end_page="3" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> The baseline version of our system is essentially a reproduction of the system described in [1] with a few modifications. The great appeal of this system is that, being machine learning based, it has no organism-specific aspects hard-coded in; moving to a new organism involves only re-training (assuming there is training data) and setting one or two parameters using a held-out data set or crossvalidation. null The system is given a set of abstracts (and associated gene identifiers at training time) and a lexicon. The system first proposes candidate phrases based on all possible phrases up to 8 words in length with some constraints based on part-ofspeech1. Matches against the lexicon are then carried out by performing exact matching but ignoring case and removing punctuation from the both the lexical entries and candidate mentions. Only maximal matching strings were used - i.e. sub-strings of matching strings that match the same id are removed. null The resulting set of matches of candidate mentions with their matched identifiers results in a set of instances. These instances are then provided with a label - &quot;yes&quot; or &quot;no&quot; depending on whether the match in the abstract is correct (i.e. if the gene identifier associated with the match was annotated with the abstract). These instances are used to train a binary maximum entropy classifier that ultimately decides if a match is valid or not.</Paragraph> <Paragraph position="1"> Maximum entropy classifiers model the conditional probability of a class, y, (in our setting, y=&quot;yes&quot; or y=&quot;no&quot;) given some observed data, x. The conditional probability has the following form in the binary case (where it is equivalent to logistic regression): 1 Specifically, we excluded phrases that began with verbs prepositions, adverbs or determiners; we found this constraint did not affect recall while reducing the number of candidate mentions by more than 50%.</Paragraph> <Paragraph position="3"> where Z(x) is the normalization function, the il are real-valued model parameters and the if are arbitrary real-valued feature functions.</Paragraph> <Paragraph position="4"> One advantage of maximum entropy classifiers is the freedom to use large numbers of statistically non-independent features. We used a number of different feature types in the classifier: for words within the phrase An example is shown below in Figure 1 below. are detailed in the table.</Paragraph> <Paragraph position="5"> In addition to these features we created additional features constituting conjunctions of some of these &quot;atomic&quot; features. For example, the conjoined feature Phrase=TOR AND GE-NEID=MGI104856 is &quot;on&quot; when both conjuncts are true of the instance.</Paragraph> <Paragraph position="6"> To assign identifiers to a new abstract a set features are extracted for each matching phrase and gene id pair just as in training (this constitutes an instance) and presented to the classifier for classification. As the classifier returns a probability for each instance, the gene id associated with the instance with highest probability is returned as a gene id associated with the abstract, except in the case where the probability is less than some threshold 10, [?][?] TT in which case no gene id is returned for that phrase.</Paragraph> <Paragraph position="7"> Training the model involves finding the parameters that maximize the log-likelihood of the training data. As is standard with maximum entropy models we employ a Gaussian prior over the parameters which bias them towards zero to reduce overfitting.</Paragraph> <Paragraph position="8"> Our model thus has just two parameters which need to be tuned to different datasets (i.e. different organisms): the Gaussian prior and the threshold, T . Tuning the parameters can be done on a held out set (we used the Task 1B development data) or by cross validation:</Paragraph> </Section> <Section position="5" start_page="3" end_page="4" type="metho"> <SectionTitle> 4 Weakly Supervised Methods for Re- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="3" end_page="4" type="sub_section"> <SectionTitle> labeling Noisy Normalization Data </SectionTitle> <Paragraph position="0"> The primary contribution of this work is a novel method for re-labeling the noisy training instances within the Task 1B training data sets. Recall that the Task 1B training data were constructed by matching phrases in the abstract against the synonym lists for the gene ids curated for the full text article for which the abstract was written. In many cases, mentions of the gene in the abstract do not appear exactly as they do in the synonym list, which would result in a missed association of that gene id with the abstract. In other cases, the data-base curators simply did not curate a gene id mentioned in the abstract as it was not relevant to their particular line of interest.</Paragraph> <Paragraph position="1"> Our method for re-labeling potentially mislabeled instances draws upon existing methods for weakly supervised learning. We describe here the generic algorithm and include specific variations below in the experimental setup.</Paragraph> <Paragraph position="2"> The first step is to partition the training data into two disjoint sets, D1 and D2.2 We then create two instances of the weakly supervised learning 2 Note that instances in D 1 and D2 are also derived form dis null joint sets of abstracts. This helps ensure that very similar instances are unlikely to appear in different partitions. problem where in one instance, D1 is viewed as the labeled training data and D2 is viewed as the unlabeled data, and in the other instance their roles are reversed. Re-labeling of instances in D1 is carried out by a classifier or ensemble of classifiers, C2 trained on D2. Similarly, instances in D2 are re-labeled by C1 trained on D1. Those instances for which the classifier assigns high confidence (i.e. for which )|&quot;&quot;( xyesyP = is high) but for which the existing label disagrees with the classifier are candidates for re-labeling. Figure 2 diagrams this training of a classifier from some set of data, while block arrows describe the data flow and re-labeling of instances.</Paragraph> <Paragraph position="3"> One assumption behind this approach is that not all of the errors in the training data labels are correlated. As such, we would expect that for a particular mislabeled instance in D1, there may be similar positive instances in D2 that provide evidence for re-labeling the mislabeled in D1.</Paragraph> <Paragraph position="4"> Initial experiments using this approach met with failure or negligible gains in performance.</Paragraph> <Paragraph position="5"> We initially attributed this to too many correlated errors. Detailed error analysis revealed, however, that a significant portion of training instances being re-labeled were derived from matches against the lexicon that were not, in fact, references to genes - i.e. they were other more common English words that happened to appear in the synonym lists for which the classifier mistakenly assigned them high probability.</Paragraph> <Paragraph position="6"> Our solution to this problem was to impose a constraint on instances to be re-labeled: The phrase in the abstract associated with the instance is required to have been tagged as a gene name by a gene name tagger in addition to the instance receiving a high probability by the re-labeling classifier. Use of a gene name tagger introduces a check against the classifier (trained on the noisy training data) and helps to reduce the chance of introducing false positives into the labeled data.</Paragraph> <Paragraph position="7"> We trained our entity tagger, Carafe, on a the Genia corpus [10] together with the BioCreative Task 1A gene name training corpus. Not all of the entity types annotated in the Genia corpus are genes, however. Therefore we used an appropriate subset of the entity types found in the corpus. Carafe is based on Conditional Random Fields [11] (CRFs) which, for this task, employed a similar set of features to the CRF described in [12].</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="5" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> The main goal of our experiments was to demonstrate the benefits of re-labeling potentially noisy training instances in the task 1B training data. In this work we focus the weakly supervised re-labeling experiments on the mouse data set. In the mouse data there is a strong bias towards false negatives in the training data - i.e. many training instances have a negative label and should have a positive one. Our reasons for focusing on this data are twofold: 1) we believe this situation is likely to be more common in practice since an organism may have impoverished synonym lists or &quot;gaps&quot; in the curated databases and 2) the experiments and resulting analyses are made clearer by focusing on re-labeling instances in one direction only (i.e.</Paragraph> <Paragraph position="1"> from negative to positive).</Paragraph> <Paragraph position="2"> In this section, we first describe an initial experiment comparing the baseline system (described above) using the original training data with a version trained with an augmented data set where labels changed based on a simple heuristic. We then describe our main body of experiments using various weakly supervised learning methods for re-labeling the data. Finally, we report our overall scores on the evaluation data for all three organisms using the best system configurations derived from the development test data.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.1 Data and Methodology </SectionTitle> <Paragraph position="0"> We used the BioCreative Task 1B data for all our experiments. For the three data sets, there were 5000 abstracts of training data and 250, 110 and 108 abstracts of development test data for mouse, fly and yeast, respectively. The final evaluation data consisted of 250 abstracts for each organism.</Paragraph> <Paragraph position="1"> In the training data, the ratios of positive to negative instances are the following: for mouse: 40279/111967, for fly: 75677/493959 and for yeast: 25108/3856. The number of features in each trained model range from 322110 for mouse, 881398 for fly and 108948 for yeast.</Paragraph> <Paragraph position="2"> Given a classifier able to rank all the test instances (in our case, the ranks derive from the probabilities output by the maximum entropy classifier), we return only the top n gene identifiers, where n is the number of correct identifiers in the development test data - this results in a balanced F-measure score. We use this metric for all experiments on the development test data as it allows better comparison between systems by factoring out the need to tune the threshold.</Paragraph> <Paragraph position="3"> On the evaluation data, we do not know n. The system returns a number of identifiers based on the threshold, T. For these experiments, we set T on the development test data and choose three appropriate values for three different evaluation &quot;submissions&quot;. null</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 5.2 Experiment Set 1: Effect of match-based </SectionTitle> <Paragraph position="0"> re-labeling Our first set of experiments uses the baseline system described earlier. We compare the results of this system using the Task 1B training data &quot;as provided&quot; with the results obtained by re-labeling some of the negative instances provided to the classifier as positive instances. We re-labeled any instances as positive that matched a gene identifier associated with the abstract regardless of the (potentially incorrect) label associated with the identifier. The Task 1B dataset creators marked an identifier &quot;no&quot; if an exact lexicon match wasn't found in the abstract. As our system matching phase is a bit different (i.e. we remove punctuation and ignore case), this amounts to re-labeling the training data using this looser criterion. The results of this match-based re-labeling are shown in Table</Paragraph> </Section> <Section position="3" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 5.3 Experiment Set 2: Effect of Weakly Su- </SectionTitle> <Paragraph position="0"> pervised Re-labeling In our next set of experiments we tested a number of different weakly supervised learning configurations. These different methods simply amount to different rankings of the instances to re-label (based on confidence and the gene name tags).</Paragraph> <Paragraph position="1"> The basic algorithm (outlined in Figure 1) remains the same in all cases. Specifically, we investigated three methods for ranking the instances to re-label: 1) naive self-training, 2) self-training with bagging, and 3) co-training.</Paragraph> <Paragraph position="2"> Naive self-training consisted of training a single maximum entropy classifier with the full feature set on each partition and using it to re-label instances from the other partition based on confidence. null Self training with bagging followed the same idea but used bagging. For each partition, we trained 20 separate classifiers on random subsets of the training data using the full feature set. The confidence assigned to a test instance was then defined as the product of the confidences of the individual classifiers.</Paragraph> <Paragraph position="3"> Co-training involved training two classifiers for each partition with feature split. We split the features into context-based features such as the surrounding words and the number of gene ids matching the current phrase, and lexically-based features that included the phrase itself, affixes, the number of tokens in the phrase, etc. We computed the aggregated confidences for each instance as the product of the confidences assigned by the resulting context-based and lexically-based classifiers. We ran experiments for each of these three options both with the gene tagger and without the gene tagger. The systems that included the gene tagger ranked all instances derived from tagged phrases above all instances derived from phrases that were not tagged regardless of the classifier confidence.</Paragraph> <Paragraph position="4"> A final experimental condition we explored was comparing batch re-labeling vs. incremental relabeling. Batch re-labeling involved training the classifiers once and re-labeling all k instances using the same classifier. Incremental re-labeling consisted of iteratively re-labeling n instances over k/n epochs where the classifiers were re-trained on each epoch with the newly re-labeled training data.</Paragraph> <Paragraph position="5"> Interestingly, incremental re-labeling did not perform better than batch re-labeling in our experiments. All results reported here, therefore, used batch re-labeling.</Paragraph> <Paragraph position="6"> After the training data was re-labeled, a single maximum entropy classifier was trained on the entire (now re-labeled) training set. This resulting classifier was then applied to the development set in the manner described in Section 3.</Paragraph> <Paragraph position="7"> scores on the mouse data set for each of the six system configurations for all values of k - the number of instances re-labeled. The numbers in parentheses indicate for which value of k the maximum value was achieved.</Paragraph> <Paragraph position="8"> We tested each of these six configurations for different values of k, where k is the total number of instances re-labeled3. Table 2 highlights the maximum and average balanced f-measure scores across all values of k for the different system configurations. Both the maximum and averaged scores appear noticeably higher when constraining the instances to re-label with the tagger. The three weakly supervised methods perform comparably with bagging performing slightly better.</Paragraph> </Section> </Section> class="xml-element"></Paper>