File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1217_metho.xml

Size: 8,613 bytes

Last Modified: 2025-10-06 14:09:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1217">
  <Title>Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web</Title>
  <Section position="3" start_page="0" end_page="89" type="metho">
    <SectionTitle>
2 System Description
</SectionTitle>
    <Paragraph position="0"> Our system is a Maximum Entropy Markov Model, which further develops a system earlier used for the CoNLL 2003 shared task (Klein et al., 2003) and the 2004 BioCreative critical assessment of information extraction systems, a task that involved identifying gene and protein name mentions but not distinguishing between them (Dingare et al., 2004). Unlike the above two tasks, many of the entities in the current task do not have good internal cues for distinguishing the class of entity: various systematic polysemies and the widespread use of acronyms mean that internal cues are lacking. The challenge was thus to make better use of contextual features, including local and syntactic features, and external resources in order to succeed at this task.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Local Features
</SectionTitle>
      <Paragraph position="0"> We used a variety of features describing the immediate content and context of each word, including the word itself, the previous and next words, word prefixes and suffix of up to a length of 6 characters, word shapes, and features describing the named entity tags assigned to the previous words. Word shapes refer to a mapping of each word onto equivalence classes that encodes attributes such as length, capitalization, numerals, greek letters, and so on.</Paragraph>
      <Paragraph position="1"> For instance, &amp;quot;Varicella-zoster&amp;quot; would become Xxxxx, &amp;quot;mRNA&amp;quot; would become xXXX, and &amp;quot;CPA1&amp;quot; would become XXXd. We also incorporated part-of-speech tagging, using the TnT tagger(Brants, 2000) retrained on the GENIA corpus gold standard part-of-speech tagging. We also used various interaction terms (conjunctions) of these base-level features in various ways. The full set of local features is outlined in Table 1.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="88" type="sub_section">
      <SectionTitle>
2.2 External Resources
</SectionTitle>
      <Paragraph position="0"> We made use of a number of external resources, including gazetteers, web-querying, use of the surrounding abstract, and frequency counts from the British National Corpus.</Paragraph>
      <Paragraph position="1">  in a pair has been assigned a different tag than the other in a window of 4 words  Many entries in gazetteers are ambiguous words, occasionally used in the sense that the gazetteer seeks to represent, but at least as frequently not. So while the information that a token was seen in a gazetteer is an unreliable indicator of whether it is an entity, less frequent words are less likely to be ambiguous than more frequent ones. Additionally, more frequent words are likely to have been seen often in the training data and the system should be better at classifying them, while less frequent words are a common source of error and their classification is more likely to benefit from the use of external resources. We assigned each word in the training and testing data a frequency category corresponding to its frequency in the British National Corpus, a 100 million word balanced corpus, and used conjunctions of this category and certain other features.  Our gazetteer contained only gene names and was compiled from lists from biomedical websites (such as LocusLink) as well as from the Gene Ontology and the data provided for the BioCreative 2004 tasks. The final gazetteer contained 1,731,496 entries. Because it contained only gene names, and for the reasons discussed earlier, we suspect that it was not terribly useful for identifying the presences of entities, but rather that it mainly helped to establish the exact beginning and ending point of multi-word entities recognized mainly through other features.  For each of the named entity classes, we built indicative contexts, such as &amp;quot;X mRNA&amp;quot; for RNA, or &amp;quot;X ligation&amp;quot; for protein. For each entity X which had a frequency lower than 10 in the British National Corpus, we submitted instantiations of each pattern to the web, using the Google API, and obtained the number of hits. The pattern that returned the highest number of hits determined the feature value (e.g., &amp;quot;web-protein&amp;quot;, or &amp;quot;web-RNA&amp;quot;). If no hits were returned by any pattern, a value &amp;quot;O-web&amp;quot; was assigned. This value was also assigned to all words whose frequency was higher than 10 (using yet another value for words with higher frequency did not improve the tagger's performance).</Paragraph>
      <Paragraph position="2">  A number of NER systems have made effective use of how the same token was tagged in different parts of the same document (see (Curran and Clark, 2003) and (Mikheev et al., 1999)). A token which appears in an unindicative context in one sentence may appear in a very obvious context in another sentence in the same abstract. To leverage this we tagged each abstract twice, providing for each token a feature indicating whether it was tagged as an entity elsewhere in the abstract. This information was only useful when combined with information on frequency. null</Paragraph>
    </Section>
    <Section position="3" start_page="88" end_page="89" type="sub_section">
      <SectionTitle>
2.3 Deeper Syntactic Features
</SectionTitle>
      <Paragraph position="0"> While the local features discussed earlier are all fairly surface level, our system also makes use of deeper syntactic features. We fully parsed the training and testing data using the Stanford Parser of (Klein and Manning, 2003) operating on the TnT part-of-speech tagging - we believe that the unlexicalized nature of this parser makes it a particularly suitable statistical parser to use when there is a large domain mismatch between the training material (Wall Street Journal text) and the target domain, but have not yet carefully evaluated this.</Paragraph>
      <Paragraph position="1"> Then, for each word in the sentence which is inside a noun phrase, the head and governor of the noun phrase are extracted. These features are not very useful when identifying only two classes (such as GENE and OTHER in the BioCreative task), but they were quite useful for this task because of the large number of classes which the system needed to distinguish between. Because the classifier is now  choosing between classes where members can look very similar, longer distance information can provide a better representation of the context in which the word appears. For instance, the word phosphorylation occurs in the training corpus 492 times, 482 of which it is was classified as other. However, it is the governor of 738 words, of which 443 are protein, 292 are other and only 3 are cell line.</Paragraph>
      <Paragraph position="2"> We also made use of abbreviation matching to help ensure consistency of labels. Abbreviations and long forms were extracted from the data using the method of (Schwartz and Hearst, 2003). This data was combined with a list of other abbreviations and long forms extracted from the BioCreative 2004 task. Then all occurrences of either the long or short forms in the data was labeled. These labels were included in the system as features and helped to improve boundary detection.</Paragraph>
    </Section>
    <Section position="4" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
2.4 Adjacent Entities
</SectionTitle>
      <Paragraph position="0"> When training our classifier, we merged the B- and I- labels for each class, so it did not learn how to differentiate between the first word of a class and internal word. There were several motivations for doing this. Foremost was memory concerns; our final system trained on just the six classes had 1.5 million features - we just did not have the resources to train it over more classes without giving up many of our features. Our second motivation was that by merging the beginning and internal labels for a particular class, the classifier would see more examples of that class and learn better how to identify it. The drawback of this move is that when two entities belonging to the same class are adjacent, our classifier will automatically merge them into one entity. We did attempt to split them back up using NP chunks, but this severely reduced performance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML