File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1302_evalu.xml
Size: 6,768 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1302"> <Title>Adaptive String Similarity Metrics for Biomedical Reference Resolution</Title> <Section position="7" start_page="12" end_page="14" type="evalu"> <SectionTitle> 6 Experiments and Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="12" end_page="13" type="sub_section"> <SectionTitle> 6.1 Data and Experimental Setup </SectionTitle> <Paragraph position="0"> We used the UMLS MetaThesaurus for all our experiments for three reasons: 1) the UMLS represents a wide-range of important biomedical concepts for many applications and 2) the size of the UMLS (compared with BioCreative Task 1B, for example) promotes statistically significant results as well as sufficient training data 3) the problem of ambiguity (multiple concepts with the same name) is largely absent in the UMLS.</Paragraph> <Paragraph position="1"> The UMLS is a taxonomy of medical and clinical concepts consisting of 1,938,701 lexical entries (phrase strings) where each entry belongs to one (or, in very rarely, more than one) of 887,688 concepts. We prepared the data by first selecting only those lexical entries belonging to a concept containing 12 or more entries. This resulted in a total of 129,463 entries belonging to 7,993 concepts. We then divided this data into a training set of 95,167 entries and test set of 34,296 entries where roughly 70% of the entries for each concept were placed in the training set and 30% in the test set. Thus, the training set and test set both contained some string entries for each of the 7,993 concepts. While restricting the number of entries to 12 or more was somewhat arbitrary, this allowed for at least 7 (70% of 12) entries in the training data for each concept, providing sufficient training data.</Paragraph> <Paragraph position="2"> The task was to assign the correct concept identifier to each of the lexical entries in the test set. This was carried out by finding the most similar string entry in the training data and returning the concept identifier associated with that entry. Since each test instance must be assigned to exactly one concept, our system simply ranked the candidate strings</Paragraph> <Paragraph position="4"> based on the string similarity metric used. We compared the results for different maximum a5 -gram match ratios. Recall that the a5 -gram match mechanism is essentially a filter; higher values correspond to larger candidate pools of strings considered by the string similarity metrics.</Paragraph> <Paragraph position="5"> We used six different string similarity metrics that were applied to the same set of candidate results returned by the a5 -gram matching procedure for each test string. These were TFIDF, Levenstein, q-gram-Best, CRF, SoftTFIDF-Lev and SoftTFIDF-CRF. TFIDF and Levenstein were described earlier. The q-gram-Best metric simply selects the match with the lowest a5 -gram match ratio returned by the a5 -gram match procedure described string similarity metric, with corresponding precision and recall values. The numbers in parentheses indicate the a5 -gram match value for which the highest F-measure was attained.</Paragraph> <Paragraph position="6"> above5. The SoftTFIDF-Lev model is the SoftTFIDF metric described earlier where the secondary metric for similarity between pairs of tokens is the Levenstein distance.</Paragraph> <Paragraph position="7"> The CRF metric is the CRF string similarity model applied to the entire strings. This model was trained on pairs of strings that belonged to the same concept in the training data, resulting in 130,504 string pair training instances. The SoftTFIDF-CRF metric is the SoftTFIDF method where the secondary metric is the CRF string similarity model. This CRF model was trained on pairs of tokens (not entire phrases). We derived pairs of tokens by finding the most similar pairs of tokens (similarity was determined here by Levenstein distance) between strings belonging to the same concept in the training data. This resulted in 336,930 string pairs as training instances.</Paragraph> </Section> <Section position="2" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 6.2 Results </SectionTitle> <Paragraph position="0"> We computed the precision, recall and F-measure for each of the string similarity metrics across different a5 -gram match ratios shown in Fig. 1. Both a precision and recall error is introduced when the top-returned concept id is incorrect; just a recall error occurs when no concept id is returned at all - i.e.</Paragraph> <Paragraph position="1"> when the a5 -gram match procedure returns the empty set of candidate strings. This is more likely to occur when for lower a5 values and explains the poor recall in those cases. In addition, we computed the mean reciprocal rank of each of the methods. This is computed using the ranked, ordered list of the concepts returned by each method. This scoring method as- null Reciprocal Rank comparisions for each string similarity metric across different a5 -gram match ratios. signs a score of a31a68a131 a94 for each test instance where a94 is the position in the ranked list at which the correct concept is found. For example, by returning the correct concept as the 4th element in the ranked list, a method is awarded a31a68a131a66a132 a12a89a133 a19a135a134a59a136 . The mean reciprocal rank is just the average score over all the test elements.</Paragraph> <Paragraph position="2"> As can be seen, the SoftTFIDF-CRF stringsimilarity metric out-performs all the other methods on this data set. This approach is robust to both word order variations and character-level differences, the latter with the benefit of being adapted to the domain. Word order is clearly a critical factor in this domain6 though the CRF metric, entirely character-based, does surprisingly well - much better than the Levenstein distance. The q-gram-Best metric, being able to handle word order variations and character-level differences, performs fairly.</Paragraph> <Paragraph position="3"> The graphs illustrate a tradeoff between efficiency and accuracy (recall). Lower a5 -gram match ratios return fewer candidates with correspondingly fewer pairwise string similarities to compute. Precision actually peaks with a a5 -gram match ratio of around 0.2. Recall tapers off even up to high q-gram levels for all metrics, indicating that nearly 30% of the test instances are probably too difficult for any string similarity metric. Error analysis indicates that these cases tend to be entries involving synonymous &quot;nicknames&quot;. Acquiring such synonyms requires other machinery, e.g., (Yu and Agichtein, 2003).</Paragraph> </Section> </Section> class="xml-element"></Paper>