File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3305_metho.xml

Size: 13,301 bytes

Last Modified: 2025-10-06 14:11:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3305">
  <Title>A Priority Model for Named Entities</Title>
  <Section position="4" start_page="33" end_page="34" type="metho">
    <SectionTitle>
UMLS Semantic Network enriched with categories
</SectionTitle>
    <Paragraph position="0"> from the GENIA Ontology (Kim et al, 2003), and a few new semantic types. We have populated SemCat with over 5 million entities of interest from  standard knowledge sources like the UMLS (Lindberg et al., 1993), the Gene Ontology (GO) (The Gene Ontology Consortium, 2000), Entrez Gene (Maglott et al., 2005), and GENIA, as well as from the World Wide Web. In this paper, we use SemCat data to compare three probabilistic frameworks for named entity classification.</Paragraph>
  </Section>
  <Section position="5" start_page="34" end_page="34" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> We constructed the SemCat database of biomedical entities, and used these entities to train and test three probabilistic approaches to gene and protein name classification: 1) a statistical language model with Witten-Bell smoothing, 2) probabilistic context-free grammars (PCFGs) and 3) a new approach we call a Priority Model for named entities.</Paragraph>
    <Paragraph position="1"> As one component in all of our classification algorithms we use a variable order Markov Model for strings.</Paragraph>
    <Section position="1" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.1 SemCat Database Construction
The UMLS Semantic Network (SN) is an ongoing
</SectionTitle>
      <Paragraph position="0"> project at the National Library of Medicine. Many users have modified the SN for their own research domains. For example, Yu et al. (1999) found that the SN was missing critical components in the genomics domain, and added six new semantic types including Protein Structure and Chemical Complex. We found that a subset of the SN would be sufficient for gene and protein name classification, and added some new semantic types for better coverage. We shifted some semantic types from suboptimal nodes to ones that made more sense from a genomics standpoint. For example, there were two problems with Gene or Genome. Firstly, genes and genomes are not synonymous, and secondly, placement under the semantic type Fully Formed Anatomical Structure is suboptimal from a genomics perspective. Since a gene in this context is better understood as an organic chemical, we deleted Gene or Genome, and added the GENIA semantic types for genomics entities under Organic Chemical. The SemCat Physical Object hierarchy is shown in Figure 1. Similar hierarchies exist for the SN Conceptual Entity and Event trees.</Paragraph>
      <Paragraph position="1"> A number of the categories have been supplemented with automatically extracted entities from MEDLINE, derived from regular expression pattern matching. Currently, SemCat has 77 semantic types, and 5.11M non-unique entries. Additional entities from MEDLINE are being manually classified via an annotation website. Unlike the Termino database (Harkema et al. (2004), which contains terminology annotated with morpho-syntactic and conceptual information, SemCat currently consists of gazetteer lists only.</Paragraph>
      <Paragraph position="2"> For our experiments, we generated two sets of training data from SemCat, Gene-Protein (GP) and Not-Gene-Protein (NGP). GP consists of specific terms from the semantic types DNA MOLECULE,</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="34" end_page="123" type="metho">
    <SectionTitle>
PROTEIN MOLECULE, DNA FAMILY,
PROTEIN FAMILY, PROTEIN COMPLEX and
PROTEIN SUBUNIT. NGP consists of entities
</SectionTitle>
    <Paragraph position="0"> from all other SemCat types, along with generic entities from the GP semantic types. Generic entities were automatically eliminated from GP using pattern matching to manually tagged generic phrases like abnormal protein, acid domain, and RNA.</Paragraph>
    <Paragraph position="1"> Many SemCat entries contain commas and parentheses, for example, &amp;quot;receptors, tgf beta.&amp;quot; A better form for natural language processing would be &amp;quot;tgf beta receptors.&amp;quot; To address this problem, we automatically generated variants of phrases in GP with commas and parentheses, and found their counts in MEDLINE. We empirically determined the heuristic rule of replacing the phrase with its second most frequent variant, based on the observation that the most frequent variant is often too generic. For example, the following are the phrase variant counts for &amp;quot;heat shock protein (dnaj)&amp;quot;: * heat shock protein (dnaj) 0 * dnaj heat shock protein 84 * heat shock protein 122954 * heat shock protein dnaj 41 Thus, the phrase kept for GP is dnaj heat shock protein.</Paragraph>
    <Paragraph position="2"> After purifying the sets and removing ambiguous full phrases (ambiguous words were retained), GP contained 1,001,188 phrases, and NGP contained 2,964,271 phrases. From these, we randomly generated three train/test divisions of 90% train/10% test (gp1, gp2, gp3), for the evaluation.</Paragraph>
    <Section position="1" start_page="34" end_page="123" type="sub_section">
      <SectionTitle>
3.2 Variable Order Markov Model for Strings
</SectionTitle>
      <Paragraph position="0"> As one component in our classification algorithms we use a variable order Markov Model for strings.</Paragraph>
      <Paragraph position="1"> Suppose C represents a class and  sents a string of characters. In order to estimate the probability that</Paragraph>
      <Paragraph position="3"> ...</Paragraph>
      <Paragraph position="4"> n pxxx x does not depend on the class and because we are generally comparing probability estimates between classes, we ignore this factor in our calculations and concentrate our efforts on evaluating ()( )</Paragraph>
      <Paragraph position="6"> (2) which is an exact equality. The final step is to give our best approximation to each of the num-</Paragraph>
      <Paragraph position="8"> . To make these approximations we assume that we are given a set of strings and associated probabilities ()</Paragraph>
      <Paragraph position="10"> where for each i , and 0</Paragraph>
      <Paragraph position="12"> p is assumed to represent the probability that belongs to the class C . Then for the given string</Paragraph>
      <Paragraph position="14"/>
      <Paragraph position="16"> or there may be other ways to make this estimate. This basic scheme works well, but we have found that we can obtain a modest improvement by adding a unique start character to the beginning of each string. This character is assumed to occur nowhere else but as the first character in all strings dealt with including any string whose probability we are estimating.</Paragraph>
      <Paragraph position="17"> This forces the estimates of probabilities near the beginnings of strings to come from estimates based on the beginnings of strings. We use this approach in all of our classification algorithms.</Paragraph>
      <Paragraph position="18"> Table 1. Each fragment in the left column appears in the training data and the probability in the right column represents the probability of seeing the underlined portion of the string given the occurrence of the initial ununderlined portion of the string in a training string.</Paragraph>
      <Paragraph position="20"> In Table 1, we give an illustrative example of the string apoe-epsilon which does not appear in the training data. A PubMed search for apoe-epsilon gene returns 269 hits showing the name is known. But it does not appear in this exact form in SemCat.</Paragraph>
    </Section>
    <Section position="2" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
3.3 Language Model with Witten-Bell Smooth-
</SectionTitle>
      <Paragraph position="0"> ing A statistical n-gram model is challenged when a bigram in the test set is absent from the training set, an unavoidable situation in natural language due to Zipf's law. Therefore, some method for assigning nonzero probability to novel n-grams is required. For our language model (LM), we used Witten-Bell smoothing, which reserves probability mass for out of vocabulary values (Witten and Bell, 1991, Chen and Goodman, 1998). The discounted probability is calculated as  where is the number of distinct words that can appear after in the training data. Actual values assigned to tokens outside the training data are not assigned uniformly but are filled in using a variable order Markov Model based on the strings seen in the training data.</Paragraph>
      <Paragraph position="1">  (1969). For technical details we refer the reader to Charniak (1993). For gene and protein name classification, we tried two different approaches. In the first PCFG method (PCFG-3), we used the following simple productions:  1) CATP - CATP CATP 2) CATP - CATP postCATP 3) CATP - preCATP CATP CATP refers to the category of the phrase, GP or NGP. The prefixes pre and post refer to begin null nings and endings of the respective strings. We trained two separate grammars, one for the positive examples, GP, and one for the negative examples, NGP. Test cases were tagged based on their score from each of the two grammars.</Paragraph>
      <Paragraph position="2"> In the second PCFG method (PCFG-8), we combined the positive and negative training examples into one grammar. The minimum number of non-terminals necessary to cover the training sets gp1-3 was six {CATP, preCATP, postCATP, Not-CATP, preNotCATP, postNotCATP}. CATP represents a string from GP, and NotCATP represents a string from NGP. We used the following  production rules: 1) CATP - CATP CATP 2) CATP - CATP postCATP 3) CATP - preCATP CATP 4) CATP - NotCATP CATP 5) NotCATP - NotCATP NotCATP 6) NotCATP - NotCATP postNotCATP 7) NotCATP- preNotCATP NotCATP 8) NotCATP - CATP NotCATP  It can be seen that (4) is necessary for strings like &amp;quot;human p53,&amp;quot; and (8) covers strings like &amp;quot;p53 pathway.&amp;quot; In order to deal with tokens that do not appear in the training data we use variable order Markov Models for strings. First the grammar is trained on the training set of names. Then any token appearing in the training data will have assigned to it the tags appearing on the right side of any rule of the grammar (essentially part-of-speech tags) with probabilities that are a product of the training. We then construct a variable order Markov Model for each tag type based on the tokens in the training data and the assigned probabilities for that tag type. These Models (three for PCFG-3 and six for PCFG-8) are then used to assign the basic tags of the grammar to any token not seen in training. In this way the grammars can be used to classify any name even if its tokens are not in the training data.</Paragraph>
    </Section>
    <Section position="3" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
3.5 Priority Model
</SectionTitle>
      <Paragraph position="0"> There are problems with the previous approaches when applied to names. For example, suppose one is dealing with the name &amp;quot;human liver alkaline phosphatase&amp;quot; and class represents protein names and class anatomical names. In that case a language model is no more likely to favor than . We have experimented with PCFGs and have found the biggest challenge to be how to choose the grammar. After a number of attempts we have still found problems of the &amp;quot;human liver alkaline phosphatase&amp;quot; type to persist.</Paragraph>
      <Paragraph position="1">  The difficulties we have experienced with language models and PCFGs have led us to try a different approach to model named entities. As a general rule in a phrase representing a named entity a word to the right is more likely to be the head word or the word determining the nature of the entity than a word to the left. We follow this rule and construct a model which we will call a Priority Model. Let be the set of training data (names) for class and likewise for . Let  probability that the appearance of the token t  of the class of a name. Let be composed of the tokens on the right in the given order. Then we compute the probability</Paragraph>
      <Paragraph position="3"> This formula comes from a straightforward interpretation of priority in which we start on the right side of a name and compute the probability the name belongs to class stepwise. If is the rightmost token we multiple the reliability times the significance  , which represents the contribution of . The remaining or unused probability is and this is passed to the next token to the left, . The probability is scaled by the reliability and then the significance of</Paragraph>
      <Paragraph position="5"> obtain , which is the contribution of toward the probability that the name is of class . The remaining probability is now and this is again passed to the next token to the left, etc. At the last token on the left the reliability is not used to scale because there are no further tokens to the left and</Paragraph>
      <Paragraph position="7"> From (5), (6), and (8) it is straightforward to compute the gradient of as a function of F x  This allows us to apply the priority model to any name to predict its classification based on equation 5.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML