File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-1208_concl.xml
Size: 2,974 bytes
Last Modified: 2025-10-06 13:54:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1208"> <Title>Distributed Modules for Text Annotation and IE applied to the Biomedical Domain</Title> <Section position="7" start_page="52" end_page="52" type="concl"> <SectionTitle> 4 Conclusion </SectionTitle> <Paragraph position="0"> The presented server solution has been set up to support curators of biomedical facts in their work. Its modules identify domain knowledge for molecular biologists and automatically link into public data resources. We are unaware of any existing solution like ours, which can integrate modules for information extraction tasks into a process pipeline based on XML. In collaboration with curation teams for UniProt and COSMIC, the modules will undergo evaluation for their usefulness in the curation process.</Paragraph> <Paragraph position="1"> Eventually, information will be automatically extracted and inserted into public databases.</Paragraph> <Paragraph position="2"> Every module needs proper evaluation. Mutation extraction already produces reliable data (Rebholz-Schuhmann et al., 2004), but will be extended (chromosomal aberrations). The protein-protein interaction module relies on chunk parsing and demonstrates how NLP is integrated as a separate module. Together with curation teams single modules will be adapted to their needs. In particular the integration of controlled vocabularies for species and tissue types are of strong interest as well as additional NLP modules, e.g. for the identification of gene regulation.</Paragraph> <Paragraph position="3"> A given combination of modules has to consider the dependencies between modules to allow efficient handling of information extraction tasks. When a user requests tagging of UniProt protein and gene names as well as information extraction for protein/protein interactions, the former is actually redundant, because it has to be run anyway for the latter to work. As a conclusion the curation teams will propose the proper combination of modules that they need.</Paragraph> <Paragraph position="4"> Normalization of identified information is another step. One example is simplication of acronym definitions, e.g. transformation of &quot;...androgen receptor (AR) ...&quot; into &quot;<ac id='1'>AR</ac>&quot; with meta data accompanying the sentence specifying the expansion &quot;<ex id='1'>androgen receptor</ex>&quot;. The result is normalized text which is easier to parse and thereby leads to better IE results.</Paragraph> <Paragraph position="5"> The server has been tested on Medline abstracts and on Pdf documents (full papers from Medline). As (Shah et al., 2003) have shown, the sections of full text scientific publications have noticably different information content.</Paragraph> <Paragraph position="6"> The modular system described allows us to easily add a module for sectioning of full text publications. null</Paragraph> </Section> class="xml-element"></Paper>