File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1306_metho.xml

Size: 17,318 bytes

Last Modified: 2025-10-06 14:08:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1306">
  <Title>Boosting Precision and Recall of Dictionary-Based Protein Name Recognition</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Method Overview
</SectionTitle>
    <Paragraph position="0"> Our protein name recognition method consists of two phases. In the first phase, we scan the text for protein name candidates using a dictionary. In the second phase, we check each candidate whether it is really protein name or not using a machine learning method. We call these two phases recognition phase and filtering phase respectively. The overview of the method is given below.</Paragraph>
    <Paragraph position="1"> Recognition phase Protein name candidates are identified using a protein name dictionary. To alleviate the problem of spelling variation, we use an approximate string matching technique.</Paragraph>
    <Paragraph position="2"> Filtering phase Every protein name candidates is classified into &amp;quot;accepted&amp;quot; or &amp;quot;rejected&amp;quot; by a classifier. The classifier uses the context of the term and the term itself as the features for the classification. Only &amp;quot;accepted&amp;quot; candidates are recognized as protein names.</Paragraph>
    <Paragraph position="3"> In the following sections, we describe the details of each phase.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Candidate Recognition
</SectionTitle>
    <Paragraph position="0"> The most straightforward way to exploit a dictionary for candidate recognition is the exact (longest) match algorithm. For exact match, many fast matching algorithms (e.g. Boyer-Moore algorithm (1977)) have been proposed. However, the existence of many spelling variations for the same protein name makes the exact matching less attractive. For example, even a short protein name &amp;quot;EGR-1&amp;quot; has at least the six following variations: EGR-1, EGR 1, Egr-1, Egr 1, egr-1, egr 1.</Paragraph>
    <Paragraph position="1"> Since longer protein names have a huge number of possible variations, it is impossible to enrich the dictionary by expanding each protein name as described above.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Approximate String Searching
</SectionTitle>
      <Paragraph position="0"> To deal with the problem of spelling variation, we need a kind of 'elastic' matching algorithm, by which a recognition system scan a text to find a similar term to (if any) a protein name in the dictionary. We need a similarity measure to do such a task. The most popular measure of similarity between two strings is edit distance, which is the minimum number of operations on individual characters (e.g.</Paragraph>
      <Paragraph position="1"> substitutions, insertions, and deletions) required to transform one string of symbols into another. For example, the edit distance between &amp;quot;EGR-1&amp;quot; and &amp;quot;GR2&amp;quot; is two, because one substitution (1 for 2) and one deletion (E) are required.</Paragraph>
      <Paragraph position="2"> To calculate the edit distance between two strings, we can use a dynamic programming technique. Figure 1 illustrates an example. For clarity of presentation, all costs are assumed to be 1. The matrix</Paragraph>
      <Paragraph position="4"> The calculation can be done by either a rowwise left-to-right traversal or a column-wise top-to-bottom traversal.</Paragraph>
      <Paragraph position="5"> There are many fast algorithms other than the dynamic programming for uniform-cost edit distance, where the weight of each edit operation is constant within the same type (Navarro, 2001). However, what we expect is that the distance between &amp;quot;EGR-1&amp;quot; and &amp;quot;EGR 1&amp;quot; will be smaller than that between &amp;quot;EGR-1&amp;quot; and &amp;quot;FGR-1&amp;quot;, while the uniform-cost edit distances of them are equal.</Paragraph>
      <Paragraph position="6"> The dynamic programming based method is flexible enough to allow us to define arbitrary costs for individual operations depending on a letter being operated. For example, we can make the cost of the substitution between a space and a hyphen much lower than that of the substitution between 'E' and 'F.' Therefore, we use the dynamic programming based method for our task.</Paragraph>
      <Paragraph position="7"> Table 1 shows the cost function used in our experiments. Both insertion and deletion costs are 100 except for spaces and hyphens. Substitution costs for similar letters are 10. Substitution costs for the other different letters are 50.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 String Searching
</SectionTitle>
      <Paragraph position="0"> We have described a method for calculating the similarity between two strings in the previous section. However, what we need is approximate string searching in which the recognizer scans a text to find a similar term to (if any) a term in the dictionary. The dynamic programming based method can be easily extended for approximate string searching.</Paragraph>
      <Paragraph position="1"> The method is illustrated in Figure 2. The protein name to be matched is &amp;quot;EGR-1&amp;quot; and the text to be scanned is &amp;quot;encoded by EGR include.&amp;quot; String searching can be done by just setting the elements corresponding separators (e.g. space) in the first row  A numeral for a numeral 10 space for hyphen 10 hyphen for space 10 A capital letter for the corresponding small letter 10 A small letter for the corresponding capital letter 10 Other letters 50 to zero. After filling the whole matrix, one can find that &amp;quot;EGR-1&amp;quot; can be matched to this text at the place of &amp;quot;EGR 1&amp;quot; with cost 1 by searching for the lowest value in the bottom row.</Paragraph>
      <Paragraph position="2"> To take into account the length of a term, we adopt a normalized cost, which is calculated by dividing the cost by the length of the term: (nomalized cost) = (cost) + (length of the term) (4) where is a constant value 3. When the costs of two terms are the same, the longer one is preferred due to this constant.</Paragraph>
      <Paragraph position="3"> To recognize a protein name in a given text, we perform the above calculation for every term contained in the dictionary and select the term that has the lowest normalized cost.</Paragraph>
      <Paragraph position="4"> If the normalized cost is lower than the predefined threshold. The corresponding range in the text is recognized as a protein name candidate.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Implementation Issues for String Searching
</SectionTitle>
      <Paragraph position="0"> A naive way for string searching using a dictionary is to conduct the procedure described in the previous section one by one for every term in the dictionary. However, since the size of the dictionary is very large, this naive method takes too much time to  Navarro (2001) have presented a way to reduce redundant calculations by constructing a trie of the dictionary. The trie is used as a device to avoid repeating the computation of the cost against same prefix of many patterns. Suppose that we have just calculated the cost of the term &amp;quot;EGR-1&amp;quot; and next we have to calculate the cost of the term &amp;quot;EGR-2,&amp;quot; it is clear that we do not have to re-calculated the first four rows in the matrix (see Figure 2). They also pointed out that it is possible to determine, prior to reaching the bottom of the matrix, that the current term cannot produce any relevant match: if all the values of the current row are larger than the threshold, then a match cannot occur since we can only increase the cost or at best keep it the same.</Paragraph>
      <Paragraph position="1"> 4 Filtering Candidates by a Naive Bayes Classifier One of the serious problems of dictionary-based recognition is a large number of false recognitions mainly caused by short entries in the dictionary. For example, the dictionary constructed from GenBank contains an entry &amp;quot;NK.&amp;quot; However, the word &amp;quot;NK&amp;quot; is frequently used as a part of the term &amp;quot;NK cells.&amp;quot; In this case, &amp;quot;NK&amp;quot; is an abbreviation of &amp;quot;natural killer&amp;quot; and is not a protein name. Therefore this entry makes a large number of false recognitions leading to low precision performance.</Paragraph>
      <Paragraph position="2"> In the filtering phase, we use a classifier trained on an annotated corpus to suppress such kind of false recognition. The objective of this phase is to improve precision without the loss of recall.</Paragraph>
      <Paragraph position="3"> We conduct binary classification (&amp;quot;accept&amp;quot; or &amp;quot;reject&amp;quot;) on each candidate. The candidates that are classified into &amp;quot;rejected&amp;quot; are filtered out. In other words, only the candidates that are classified into &amp;quot;accepted&amp;quot; are recognized as protein names.</Paragraph>
      <Paragraph position="4"> In this paper, we use a naive Bayes classifier for this classification task.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Naive Bayes classifier
</SectionTitle>
      <Paragraph position="0"> The naive Bayes classifier is a simple but effective classifier which has been used in numerous applications of information processing such as image recognition, natural language processing and information retrieval (Lewis, 1998; Escudero et al., 2000; Pedersen, 2000; Nigam and Ghani, 2000).</Paragraph>
      <Paragraph position="1"> Here we briefly review the naive Bayes model.</Paragraph>
      <Paragraph position="2"> Let ~x be a vector we want to classify, and ck be a possible class. What we want to know is the probability that the vector ~x belongs to the class ck. We first transform the probability P(ckj~x) using Bayes' rule,</Paragraph>
      <Paragraph position="4"> Class probability P(ck) can be estimated from training data. However, direct estimation of P(ckj~x) is impossible in most cases because of the sparseness of training data.</Paragraph>
      <Paragraph position="5"> By assuming the conditional independence among the elements of a vector, P(~xjck) is decomposed as follows,</Paragraph>
      <Paragraph position="7"> where xj is the jth element of vector ~x. Then Equation 5 becomes</Paragraph>
      <Paragraph position="9"> By this equation, we can calculate P(ckj~x) and classify ~x into the class with the highest P(ckj~x).</Paragraph>
      <Paragraph position="10"> There are some implementation variants of the naive Bayes classifier depending on their event models (McCallum and Nigam, 1998). In this paper, we adopt the multi-variate Bernoulli event model.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Features
</SectionTitle>
      <Paragraph position="0"> As the input of the classifier, the features of the target must be represented in the form of a vector. We use a binary feature vector which contains only the values of 0 or 1 for each element.</Paragraph>
      <Paragraph position="1"> In this paper, we use the local context surrounding a candidate term and the words contained in the term as the features. We call the former contextual features and the latter term features.</Paragraph>
      <Paragraph position="2"> The features used in our experiments are given be- null Wbegin : the first word of the term.</Paragraph>
      <Paragraph position="3"> Wend : the last word of the term.</Paragraph>
      <Paragraph position="4"> Wmiddle : the other words of the term without positional information (bag-of-words).</Paragraph>
      <Paragraph position="5"> Suppose the candidate term is &amp;quot;putative zinc finger protein, &amp;quot; and the sentence is: ... encoding a putative zinc finger protein was found to derepress beta- galactosidase ...</Paragraph>
      <Paragraph position="6"> We obtain the following active features for this example.</Paragraph>
      <Paragraph position="7"> fW 1 ag, fW+1 wasg, fWbegin putativeg, fWend proteing, fWmiddle zincg, fWmiddle fingerg.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Training
</SectionTitle>
      <Paragraph position="0"> The training of the classifier is done with an annotated corpus. We first scan the corpus for protein name candidates by dictionary matching. If a recognized candidate is annotated as a protein name, this candidate and its context are used as a positive (&amp;quot;accepted&amp;quot;) example for training. Otherwise, it is used as a negative (&amp;quot;rejected&amp;quot;) example.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Corpus and Dictionary
</SectionTitle>
      <Paragraph position="0"> We conducted experiments of protein name recognition using the GENIA corpus version 3.01 (Ohta et al., 2002). The GENIA corpus is an annotated corpus, which contains 2000 abstracts extracted from MEDLINE database. These abstracts are selected from the search results with MeSH terms Human, Blood Cells, and Transcription Factors.</Paragraph>
      <Paragraph position="1"> The biological entities in the corpus are annotated according to the GENIA ontology. Although the corpus has many categories such as protein, DNA, RNA, cell line and tissue, we used only the protein category. When a term was recursively annotated, only the innermost (shortest) annotation was considered. null The test data was created by randomly selecting 200 abstracts from the corpus. The remaining 1800 abstracts were used as the training data. The protein name dictionary was constructed from the training data by gathering all the terms that were annotated as proteins.</Paragraph>
      <Paragraph position="2"> Each recognition was counted as correct if the both boundaries of the recognized term exactly matched the boundaries of an annotation in the corpus. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Improving Precision by Filtering
</SectionTitle>
      <Paragraph position="0"> We first conducted experiments to evaluate how much precision is improved by the filtering process.</Paragraph>
      <Paragraph position="1"> In the recognition phase, the longest matching algorithm was used for candidate recognition.</Paragraph>
      <Paragraph position="2"> The results are shown in Table 2. F-measure is defined as the harmonic mean for precision and recall as follows:</Paragraph>
      <Paragraph position="4"> The first row shows the performances achieved without filtering. In this case, all the candidates identified in the recognition phase are regarded as protein names. The second row shows the performance achieved with filtering by the naive Bayes classifier. In this case, only the candidates that are classified into &amp;quot;accepted&amp;quot; are regarded as protein names. Notice that the filtering significantly improved the precision (from 48.6% to 74.3%) with slight loss of the recall. The F-measure was also greatly improved (from 57.6% to 69.5%).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Improving Recall by Approximate String
Search
</SectionTitle>
      <Paragraph position="0"> We also conducted experiments to evaluate how much we can further improve the recognition performance by using the approximate string searching method described in Section 3. Table 3 shows the results. The leftmost columns show the thresholds of the normalized costs for approximate string searching. As the threshold increased, the precision degraded while the recall improved. The best F-measure was 70.2%, which is better than that of exact matching by 0.7% (see Table 2).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Efficacy of Contextual Features
</SectionTitle>
      <Paragraph position="0"> The advantage of using a machine learning technique is that we can exploit the context of a candidate for deciding whether it is really protein name or not. In order to evaluate the efficacy of contexts, we conducted experiments using different feature sets.</Paragraph>
      <Paragraph position="1"> The threshold of normalized cost was set to 7.0.</Paragraph>
      <Paragraph position="2"> Table 4 shows the results. The first row shows the performances achieved by using only contextual features. The second row shows those achieved by using only term features. The performances achieved by using both feature sets are shown in the third row.</Paragraph>
      <Paragraph position="3"> The results indicate that candidate terms themselves are strong cues for classification. However, the fact that the best performance was achieved when both feature sets were used suggests that the context of a candidate conveys useful information about the semantic class of the candidate.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML