File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1302_metho.xml

Size: 11,139 bytes

Last Modified: 2025-10-06 14:10:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1302">
  <Title>Adaptive String Similarity Metrics for Biomedical Reference Resolution</Title>
  <Section position="4" start_page="10" end_page="11" type="metho">
    <SectionTitle>
3 Methods for Computing String
</SectionTitle>
    <Paragraph position="0"> Similarity A central component in the process of normalization or reference resolution is computing string similarity between two strings. Methods for measuring string similarity can generally be broken down into character-based and token-based approaches.</Paragraph>
    <Paragraph position="1"> Character-based approaches typically consist of the edit-distance metric and variants thereof. Edit distance considers the number of edit operations (addition, substitution and deletion) required to transform a string a6a8a7 into another string a6a10a9 . The Levenstein distance assigns unit cost to all edit operations. Other variations allow arbitrary costs or special costs for starting and continuing a &amp;quot;gap&amp;quot; (i.e., a long sequence of adds or deletes).</Paragraph>
    <Paragraph position="2"> Token-based approaches include the Jaccard similarity metric and the TF/IDF metric. The methods consider the (possibly weighted) overlap between the tokens of two strings. Hybrid token and character-based are best represented by SoftTFIDF, which includes not only exact token matches but also close matches (using edit-distance, for example). Another approach is to perform the Jaccard similarity (or TF/IDF) between the a5 -grams of the two strings instead of the tokens. See Cohen et al. (2003) for a detailed overview and comparison of some of these methods on different data sets.</Paragraph>
    <Paragraph position="3"> 3There are two more description names for the human and mouse entries. The SwissProt database has also associated Gene names to those entries which are related to some of the possible names that we find in the literature. Those gene names are: CDKN1A, CAP20, CDKN1, CIP1, MDA6, PIC1, SDI1, WAF1, Cdkn1a, Cip1, Waf1. It can be seen that those names are incorporated in the UMLS as protein names.</Paragraph>
    <Paragraph position="4">  Recent work has also focused on automatic methods for adapting these string similarity measures to specific data sets using machine learning. Such approaches include using classifiers to weight various fields for matching database records (Cohen and Richman, 2001). (Belenko and Mooney, 2003) presents a generative, Hidden Markov Model for string similarity.</Paragraph>
  </Section>
  <Section position="5" start_page="11" end_page="12" type="metho">
    <SectionTitle>
4 An Adaptive String Similarity Model
</SectionTitle>
    <Paragraph position="0"> Conditional Random Fields (CRF) are a recent, increasingly popular approach to sequence labeling problems. Informally, a CRF bears resemblance to a Hidden Markov Model (HMM) in which, for each input position in a sequence, there is an observed variable and a corresponding hidden variable. Like HMMs, CRFs are able to model (Markov) dependencies between the hidden (predicted) variables. However, because CRFs are conditional, discriminatively trained models, they can incorporate arbitrary overlapping (non-independent) features over the entire input space -- just like a discriminative classifier. null CRFs are log-linear models that compute the probability of a state sequence, a11a6a13a12a15a14a16a6 a7a18a17 a6 a9a10a17a18a19a20a19a20a19a20a17 a6a18a21a23a22 , given an observed sequence, a11</Paragraph>
    <Paragraph position="2"> where the a52a59a49 are arbitrary feature functions, the  maximize the conditional log-likelihood of the data. Given a trained CRF, the inference problem involves finding the most likely state sequence given a sequence of observations. This is done using a slightly modified version of the Viterbi algorithm (See Lafferty et al. (2001) more for details on CRFs).</Paragraph>
    <Section position="1" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.1 CRFs for String Similarity
</SectionTitle>
      <Paragraph position="0"> CRFs can be used to measure string similarity by viewing the observed sequence, a11a24 , and the state sequence, a11a6 , as sequences of characters. In practice we are presented with two strings, a5a51a7 , and a5a61a9 of possibly differing lengths. A necessary first step is to align the two strings by applying the Levenstein distance procedure as described earlier. This produces a series of edit operations where each operation has one of three possible forms: 1) a62a64a63 a24a66a65 (addition), 2)</Paragraph>
      <Paragraph position="2"> observed and hidden sequences are then derived by reading off the terms on the right and left-hand sides of the operations, respectively. Thus, the possible state values include all the characters in our domain plus the special null character, a62 .</Paragraph>
      <Paragraph position="3">  We employ a set of relatively simple features in our string similarity model described in Table 4. One motivation for keeping the set of features simple was to determine the utility of string similarity CRFs without spending effort designing domain-specific features; this is a primary motivation for taking a machine learning approach in the first place. Additionally, we have found that more specific, discriminating features (e.g., observation tri-grams with state bi-grams) tend to reduce the performance of the CRF on this domain - in some cases considerably.</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
4.2 Practical Considerations
</SectionTitle>
      <Paragraph position="0"> We discuss a few practical concerns with using CRFs for string similarity.</Paragraph>
      <Paragraph position="1"> The first issue is how to scale CRFs to this task. The inference complexity for CRFs is a82a83a14a16a6 a9 a55 a22 where a6 is the size of the vocabulary of states and a55 is the number of input positions. In our setting, the number of state variable values is very large - one for each character in our alphabet (which is on the order of 40 or more including digits and punctuation).</Paragraph>
      <Paragraph position="2"> Moreover, we typically have very large training sets largely due to the fact that a84a46a85 a9a18a86 training pairs are derivable from an equivalence class of size a87 . Given this situation, standard training for CRFs becomes unwieldy, since it involves performing inference over the entire data set repeatedly (typically a few hundred iterations are required to converge).</Paragraph>
      <Paragraph position="3">  As such, we resort to an approximation: Voted Perceptron training (Collins, 2002). Voted Perceptron training does not involve maximizing log-likelihood, but instead updates parameters via stochastic gradient descent with a small number of passes over the data.</Paragraph>
      <Paragraph position="4"> Another consideration that arises is given a pair of strings, which one should be considered the &amp;quot;observed&amp;quot; sequence and which one the &amp;quot;hidden&amp;quot; sequence. null Another consideration that arises is given a pair of strings, which string should be considered the &amp;quot;observed&amp;quot; sequence and which the &amp;quot;hidden&amp;quot; sequence?4 We have taken to always selecting the longest string as the &amp;quot;observed&amp;quot; string, as it appears most natural, though that decision is somewhat arbitrary. null A last observation is that the probability assigned to a pair of strings by the model will be reduced geometrically for longer string pairs (since the probability is computed as a product of a55 terms, where a55 is the length of the sequence). We have taken to normalizing the probabilities by the length of the sequence roughly following the approach of (Belenko and Mooney, 2003).</Paragraph>
      <Paragraph position="5"> A final point here is that it is possible to use Viterbi decoding to find the a87 -best hidden strings given only the observed string. This provides a mechanism to generate domain-specific string alterations for a given string ranked by their probability. The advantage of this approach is that such alterations can be used to expand a synonym list; exact matching can then be used greatly increasing efficiency. Work is ongoing in this area.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="12" type="metho">
    <SectionTitle>
5 Matching Procedure
</SectionTitle>
    <Paragraph position="0"> Our matching procedure in this paper is set in the context of finding the concept or entity (each with some existing set of known strings) that a given string, a6 , is referring to. In many settings, such as the BioCreative Task 1B task mentioned above, it is necessary to match large numbers of strings against the lexicon - potentially every possible phrase in a large 4Note that a standard use for models such as this is to find the most likely hidden sequence given only the observed sequence. In our setting here we are provided the hidden sequence and wish to compute it's (log-)probability given the observed sequence. null number of documents. As such, very fast matching times (typically on the order of milliseconds) are required. null Our method can be broken down into two steps.</Paragraph>
    <Paragraph position="1"> We first select a reasonable candidate set of strings (associated with a concept or lexical entry), a88a89a12</Paragraph>
    <Paragraph position="3"> , reasonably similar to the given string a6 using an efficient method. We then use one of a number of string similarity metrics on all the pairs:</Paragraph>
    <Paragraph position="5"> The set of candidate strings, a6a51a7 a17 a6a68a9 a17a18a19a20a19a20a19a20a17 a6  is determined by the a5 -gram match ratio, which we define as:</Paragraph>
    <Paragraph position="7"> where a87a102a101a107a14a29a108a102a22a109a12a111a110a61a112a69a27 such that a112 is a a5 -gram of a108a114a113 .</Paragraph>
    <Paragraph position="8"> This set is retrieved very quickly by creating a a5 -gram index: a mapping between each a5 -gram and the strings (entries) in which it occurs. At query time, the given string is broken into a5 -grams and the sets corresonding to each a5 -gram are retrieved from the index. A straightforward computation finds those entries that have a certain number of a5 -grams in common with the query string a6 from which the ratio can be readily computed.</Paragraph>
    <Paragraph position="9"> Depending on the setting, three options are possible given the returned set of candidates for a string</Paragraph>
    <Paragraph position="11"> 1. Consider a6 and a6a66a67 equivalent where a6a66a67 is the most similar string 2. Consider a6 and a6 a67 equivalent where a6 a67 is the most similar string and a6 a97a116a115 a14a16a6 a17 a6 a67 a22a118a117a120a119 , for some threshold a119 3. Consider a6 and a6 a67 equivalent for all a6 a67 where</Paragraph>
    <Paragraph position="13"> In the experiments in this paper, we use the first criterion since for a given string, we know that it should be assigned to exactly one concept (see below). null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML