File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-1221_concl.xml
Size: 2,287 bytes
Last Modified: 2025-10-06 13:54:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1221"> <Title>Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets</Title> <Section position="5" start_page="106" end_page="106" type="concl"> <SectionTitle> 5 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> In short, I have presented in detail a framework for recognizing multiple entity classes in biomedical abstracts with Conditional Random Fields. I have shown that a CRF-based model with only simple orthographic features can achieve performance near the current state of the art, while using semantic lexicons (as presented here) do not positively affect performance.8 null While the system presented here shows promise, there is still much to be explored.</Paragraph> <Paragraph position="1"> Richer syntactic information such as shallow parsing may be useful. The method introduced in section 3.2 to generate semantic keywords can also be adapted to generate features for entityspecific morphology (e.g. affixes) and context, both linearly (e.g. neighboring words) and hierarchically (e.g. from a parse).</Paragraph> <Paragraph position="2"> Most interesting, though, might be to investigate why the lexicons do not generally help. One explanation is simply an issue of tokenization. While one abstract refers to &quot;IL12,&quot; others may write &quot;IL-12&quot; or &quot;IL 12.&quot; Similarly, the generalization of entities to groups (e.g. &quot;x antibody&quot; vs. &quot;x antibodies&quot;) can cause problems for these rigid lexicons that require exact matching. Enumerating all such variants for every entry in a lexicon is absurd. Perhaps relaxing the matching criteria and standardizing tokenization for both the input and lexicons will improve their utility.</Paragraph> <Paragraph position="3"> 8More recent work (not submitted for evaluation) indicates that lexicons are indeed useful, but mainly when training data are limited. I have also found that using orthographic features with part-of-speech tags and only the RNA and CELL-LINE (rare class) lexicons can boost overall F1 to 70.3 on the evaluation data, with particular improvements for the RNA and CELL-LINE entities.</Paragraph> </Section> class="xml-element"></Paper>