File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0302_intro.xml

Size: 6,388 bytes

Last Modified: 2025-10-06 14:01:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0302">
  <Title>Tagging Gene and Protein Names in Full Text Articles</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Methods
</SectionTitle>
    <Paragraph position="0"> We first give an overview of ABGene's method for extracting gene and protein names from biomedical citations, and then present some modifications to ABGene designed to improve its performance on full text articles.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 ABGene Overview
</SectionTitle>
      <Paragraph position="0"> We previously trained the Brill POS tagger (Brill, 1994) to recognize protein and gene names in biomedical text using a training set of 7,000 Medline sentences. We updated the lexicon included in the Brill package (Brown Corpus plus Wall Street Journal corpus) with entries from the UMLS SPECIALIST lexicon (McCray et al. 1994, Humphreys et al. 1998), and generated a list of bigrams and a word list from all of MEDLINE to customize the training for our purposes. ABGene processing begins by using these automatically generated rules from the Brill tagger to extract single word gene and protein names from biomedical abstracts (see Table 1).</Paragraph>
      <Paragraph position="1"> This is followed by extensive filtering for false positives and false negatives. A key step during the filtering stage is the extraction of multi-word gene and protein names that are prevalent in the literature but inaccessible to the Brill tagger. During the false positive filtering step, the GENE tag is removed from a word if it matches a term from a list of 1,505 precompiled general biological terms (acids, antagonist, assembly, antigen, etc.), 39 amino acid names, 233 restriction enzymes, 593 cell lines, 63,698 organism names from the NCBI Taxonomy  from NNP to GENE if the word gene can appear to the right -A hassuf 2 GENE  Change the tag of a word from anything to GENE if it contains the suffix -A c- haspref 2 GENE Change the tag of a word from anything to GENE if it contains the prefix c- null the Brill tagger. NNP = proper noun, CD = cardinal number, CC = coordinating conjunction, JJ = adjective, VBG = verb, gerund/present participle Database (Wheeler et al. 2000) or 4,357 nonbiological terms. Non-biological terms were obtained by comparing word frequencies in MEDLINE versus the Wall Street Journal (WSJ) using the following expression, where p is the probability of occurrence: log(p(word occurs in MEDLINE)/ p(word occurs in WSJ) )&lt; 1 Additional false positives are found by regular expressions including numbers followed by measurements (25 mg/ml) and common drug suffixes (-ole, -ane, -ate, -ide, -ine, -ite, -ol, -ose, cooh).</Paragraph>
      <Paragraph position="2"> The false negative filter recovers a single word name if it: 1) matches a list of 34,555 single word names and 7611 compound word names compiled from LocusLink (Pruitt &amp; Maglott 2001) and the Gene Ontology Consortium (2000) (Wain et al., 2002) and contains a good context word before or after the name, or 2) contains a low frequency trigram and a good context word before or after the name. The context words were automatically generated by a probabilistic algorithm, using the LocusLink/Gene Ontology set and a large collection of texts in which these gene names occur. We computed a log odds score or Bayesian weight for all non-gene name words indicating their propensity to predict an adjacent gene name in the texts.</Paragraph>
      <Paragraph position="3"> Compound word names are recovered using terms that occur frequently in known gene names. Recombination of these terms produce compound words that also tend to be gene/protein names. These terms include the digits 1-9, the letters a-z, the roman numerals, the Greek letters, functional descriptors (adhesion), organism identifiers (hamster), activity descriptors (promoting), placement indicators (early), and generic descriptors (light). In addition to the 415 exact terms, we added regular expressions that allow for partial matches or special patterns such as words without vowels, words with numbers and letters, words in capital letters, and common prefixes and suffixes (-gene, -like, -ase).</Paragraph>
      <Paragraph position="4"> Finally, Bayesian learning (Langley 1996, Mitchell 1997, Wilbur 2000) is applied to rank documents by similarity to documents with known gene/protein names. Documents below a certain threshold are considered to have no gene/protein names in them.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Modifications for Full Text Articles
</SectionTitle>
      <Paragraph position="0"> The full text PMC articles are longer than abstracts, and contain extraneous information like grant numbers and laboratory reagents, along with figures and tables. An attempt to take windows of varying sizes of the full text in order to rank the windows by similarity to abstracts with known gene names was unsuccessful. High scoring windows often hid false positives, and low scoring windows could contain gene and protein name contexts infrequently encountered in Medline abstracts.</Paragraph>
      <Paragraph position="1"> However, we determined that the classifier could be used on the sentence level for full text articles, and show the effect of an assumption that sentences below a zero threshold do not contain gene/protein names.</Paragraph>
      <Paragraph position="2"> We tried to increase the performance of ABGene on the PMC articles by adding a final processing step. We ran ABGene on 2.16 million Medline abstracts similar to documents with known gene names, and extracted 2.42 million unique gene/protein names. We counted the number of times each unique name was given the GENE tag by ABGene in the 2.16 million abstracts, and then extracted three groups of putative gene/protein names from this large set, with count thresholds at 10 (134,809 names), 100 (13,865 names) and 1000 (1136 names).</Paragraph>
      <Paragraph position="3"> During the final stage of processing, terms in sentences with scores greater than 2 are checked against these lists of supposed gene/protein names. We show the effect of tagging terms with counts of at least 10, 100 and 1000 in the putative gene/protein list.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML