File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1221_metho.xml

Size: 10,118 bytes

Last Modified: 2025-10-06 14:09:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1221">
  <Title>Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets</Title>
  <Section position="2" start_page="0" end_page="104" type="metho">
    <SectionTitle>
2 Conditional Random Fields
</SectionTitle>
    <Paragraph position="0"> Biomedical named entity recognition can be thought of as a sequence segmentation problem: each word is a token in a sequence to be assigned a label (e.g. PROTEIN, DNA, RNA, CELL-LINE, CELL-TYPE, or OTHER1). Conditional Random Fields (CRFs) are undirected statistical graphical models, a special case of which is a linear chain that corresponds to a conditionally trained finite-state machine. Such models are well suited to sequence analysis, and CRFs in 1More accurately, the data is in IOB format. B-DNA labels the first word of a DNA mention, I-DNA labels all subsequent words (likewise for other entities), and O labels non-entities. For simplicity, this paper only refers to the entities, not all the IOB label variants.</Paragraph>
    <Paragraph position="1"> particular have been shown to be useful in part-of-speech tagging (Lafferty et al., 2001), shallow parsing (Sha and Pereira, 2003), and named entity recognition for newswire data (McCallum and Li, 2003). They have also just recently been applied to the more limited task of finding gene and protein mentions (McDonald and Pereira, 2004), with promising early results.</Paragraph>
    <Paragraph position="2"> Let o = &lt;o1,o2,...,on&gt; be an sequence of observed words of length n. Let S be a set of states in a finite state machine, each corresponding to a label l [?] L (e.g. PROTEIN, DNA, etc.). Let s = &lt;s1,s2,...,sn&gt; be the sequence of states in S that correspond to the labels assigned to words in the input sequence o. Linear-chain CRFs define the conditional probability of a state sequence given an input sequence to be:</Paragraph>
    <Paragraph position="4"> where Zo is a normalization factor of all state sequences, fj(si[?]1,si,o,i) is one of m functions that describes a feature, and lj is a learned weight for each such feature function. This paper considers the case of CRFs that use a first-order Markov independence assumption with binary feature functions. For example, a feature may have a value of 0 in most cases, but given the text &amp;quot;the ATPase&amp;quot; it has the value 1 along the transition where si[?]1 corresponds to a state with the label OTHER, si corresponds to a state with the label PROTEIN, and fj is the feature function Word=ATPase [?] o at position i in the sequence. Other feature functions that could have the value 1 along this transition are Capitalized, MixedCase, and Suffix=ase.</Paragraph>
    <Paragraph position="5"> Intuitively, the learned feature weight lj for each feature fj should be positive for features that are correlated with the target label, negative for features that are anti-correlated with the label, and near zero for relatively uninformative features. These weights are  set to maximize the conditional log likelihood of labeled sequences in a training set D =</Paragraph>
    <Paragraph position="7"> When the training state sequences are fully labeled and unambiguous, the objective function is convex, thus the model is guaranteed to find the optimal weight settings in terms of LL(D). Once these settings are found, the labeling for an new, unlabeled sequence can be done using a modified Viterbi algorithm. CRFs are presented in more complete detail by Lafferty et al. (2001).</Paragraph>
    <Paragraph position="8"> These experiments use the MALLET implementation of CRFs (McCallum, 2002), which uses a quasi-Newton method called L-BFGS to find these feature weights efficiently.</Paragraph>
  </Section>
  <Section position="3" start_page="104" end_page="105" type="metho">
    <SectionTitle>
3 Feature Set
</SectionTitle>
    <Paragraph position="0"> One property that makes feature based statistical models like CRFs so attractive is that they reduce the problem to finding an appropriate feature set. This section outlines the two main types of features used in these experiments.</Paragraph>
    <Section position="1" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
3.1 Orthographic Features
</SectionTitle>
      <Paragraph position="0"> The simplest and most obvious feature set is the vocabulary from the training data. Generalizations over how these words appear (e.g. capitalization, affixes, etc.) are also important. The present model includes training vocabulary, 17 orthographic features based on regular expressions (e.g. Alphanumeric, HasDash, RomanNumeral) as well as prefixes and suffixes in the character length range [3,5].</Paragraph>
      <Paragraph position="1"> Words are also assigned a generalized &amp;quot;word class&amp;quot; similar to Collins (2002), which replaces capital letters with 'A', lowercase letters with 'a', digits with '0', and all other characters with ' '. There is a similar &amp;quot;brief word class&amp;quot; feature which collapses consecutive identical characters into one. Thus the words &amp;quot;IL5&amp;quot; and &amp;quot;SH3&amp;quot; would both be given the features WC=AA0 and BWC=A0, while &amp;quot;F-actin&amp;quot; and &amp;quot;T-cells&amp;quot; would both be assigned WC=A aaaaa and BWC=A a.</Paragraph>
      <Paragraph position="2"> To model local context simply, neighboring words in the window [-1,1] are also added as features. For instance, the middle token in the sequence &amp;quot;human UDG promoter&amp;quot; would have features Word=UDG, Neighbor=human and Neighbor=promoter.</Paragraph>
    </Section>
    <Section position="2" start_page="104" end_page="105" type="sub_section">
      <SectionTitle>
3.2 Semantic Features
</SectionTitle>
      <Paragraph position="0"> In addition to orthography, the model could also benefit from generalized semantic word groups.</Paragraph>
      <Paragraph position="1"> If training sequences contain &amp;quot;PML/RAR alpha,&amp;quot; &amp;quot;beta 2-M,&amp;quot; and &amp;quot;kappa B-specific DNA binding protein&amp;quot; all labeled with PROTEIN, the model might learn that the words &amp;quot;alpha,&amp;quot; &amp;quot;beta,&amp;quot; and &amp;quot;kappa&amp;quot; are indicative of proteins, but cannot capture the fact that they are all semantically related because they are Greek letters. Similarly, words with the feature WC=Aaa are often part of protein names, such as &amp;quot;Rab,&amp;quot; &amp;quot;Alu,&amp;quot; and &amp;quot;Gag.&amp;quot; But the model may have a difficult time setting the weights for this feature when confronted with words like &amp;quot;Phe,&amp;quot; &amp;quot;Arg,&amp;quot; and &amp;quot;Cys,&amp;quot; which are amino acid abbreviations and not often labeled as part of a protein name.</Paragraph>
      <Paragraph position="2"> This sort of semantic domain knowledge can be provided in the form of lexicons. I prepared a total of 17 such lexicons, which include 7 that were entered by hand (Greek letters, amino acids, chemical elements, known viruses, plus abbreviations of all these), and 4 corresponding to genes, chromosome locations, proteins, and cell lines, drawn from online public databases (Cancer GeneticsWeb,2 BBID,3 SwissProt,4 and the Cell Line Database5). Feature functions for the lexicons are set to 1 if they match words in the input sequence exactly. For lexicon entries that are multi-word, all words are required to match in the input sequence.</Paragraph>
      <Paragraph position="3"> Since no suitable database of terms for the CELL-TYPE class was found online, a lexicon was constructed by utilizing Google Sets,6 an online tool which takes a few seed examples and leverages Google's web index to return other terms that appear in similar formatting and context as the seeds on web pages across the Internet.</Paragraph>
      <Paragraph position="4"> Several examples from the training data (e.g.</Paragraph>
      <Paragraph position="5"> &amp;quot;lymphocyte&amp;quot; and &amp;quot;neutrophil&amp;quot;) were used as seeds and new cell types (e.g. &amp;quot;chondroblast,&amp;quot; which doesn't even occur in the training data), were returned. The process was repeated until the lexicon grew to roughly 50 entries, though it could probably be more complete.</Paragraph>
      <Paragraph position="6"> With all this information at the model's disposal, it can still be difficult to properly disambiguate between these entities. For exam- null ple, the acronym &amp;quot;EPC&amp;quot; appears in these static lexicons both as a protein (&amp;quot;eosinophil cationic protein&amp;quot; [sic]) and as a cell line (&amp;quot;epithelioma papulosum cyprini&amp;quot;). Furthermore, a single word like &amp;quot;transcript&amp;quot; is sometimes all that disambiguates between RNA and DNA mentions (e.g. &amp;quot;BMLF1 transcript&amp;quot;). The CRF can learn weights for these individual words, but it may help to build general, dynamic keyword lexicons that are associated with each label to assist in disambiguating between similar classes (and perhaps boost performance on low-frequency labels, such as RNA and CELL-LINE, for which training data are sparse).</Paragraph>
      <Paragraph position="7"> These keyword lexicons are generated automatically as follows. All of the labeled terms are extracted from the training set and separated into five lists (one for each entity class). Stop words, Greek letters, and digits are filtered, and remaining words are tallied for raw frequency counts under each entity class label. These frequencies are then subjected to a kh2 test, where the null hypothesis is that a word's frequency is the same for a given entity as it is for any other entity of interest (i.e. PROTEIN vs. DNA + RNA</Paragraph>
      <Paragraph position="9"> only one degree of freedom). All words for which the null hypothesis is rejected with a p-value &lt; 0.005 are added to the keyword lexicon for its majority class. Some example keywords are listed in table 1.</Paragraph>
      <Paragraph position="10">  tures sets. Relaxed F1-scores using left- and right-boundary matching are also reported.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML