File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1140_intro.xml
Size: 3,749 bytes
Last Modified: 2025-10-06 14:02:11
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1140"> <Title>High-Performance Tagging on Medical Texts</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The application of language technology in the medical field, dubbed as medical language processing (MLP), is gaining rapid recognition (for a survey, cf. Friedman and Hripcsak (1999)). It is both important, because there is strong demand for all kinds of computer support for health care and clinical services, which aim at improving their quality and decreasing their costs, and challenging -- given the miracles of medical sublanguage, the various text genres one encounters and the enormous breadth of expertise surfacing as medical terminology.</Paragraph> <Paragraph position="1"> However, the development of human language technology for written language material has, up until now, almost exclusively focused on newswire or newspaper genres. This is most prominently evidenced by the PENN TREEBANK (Marcus et al., 1993). Its value as one of the most widely used language resources mainly derives from two features. First, it supplies everyday, non-specialist document sources, such as the Wall Street Journal, and, second, it contains value-added, viz. annotated, linguistic data. Since the understanding of newspaper material does not impose particular requirements on its reader, other than the mastery of general English and common-sense knowledge, it is easy for almost everybody to deal with. This is essential for the accomplishment of the second task, viz. the annotation and reuse of part-of-speech (POS) tags and parse trees, as the result of linguistic analysis. With the help of such resources, whole generations of state-of-the-art taggers, chunkers, grammar and lexicon learners have evolved.</Paragraph> <Paragraph position="2"> The medical field poses new challenges. First, medical documents exhibit a large variety of structural features not encountered in newspaper documents (the genre problem), and, second, the understanding of medical language requires an enormous amount of a priori medical expertise (the domain problem). Hence, the question arises, how portable results are from the newspaper domain to the medical domain? We will deal with these issues, focusing on the portability of taggers, from two perspectives. We first pick up off-the-shelf technology, in our case the rule-based Brill tagger (Brill, 1995) and the statistically-based TNT tagger (Brants, 2000), both trained on newspaper data, and run it on medical text data. One may wonder how the taggers trained on newspaper language perform with medical language. Furthermore, one may ask whether it is necessary (and, if so, costly) to retrain these taggers on a medical corpus, if one were at hand? These questions seem to be of particular importance, because the use of off-the-shelf language technology for MLP applications has recently been questioned (Campbell and Johnson, 2001). Answers will be given in Section 2.</Paragraph> <Paragraph position="3"> Once a large annotated medical corpus becomes available, additional questions can be tackled. Will taggers, e.g., improve their performance substantially when trained on medical data, or is this more or less irrelevant? Also, if medical sublanguage particularities can already be identified on the level of POS co-occurrences, would it be a good idea to enhance newspaper-oriented, general-purpose tagsets with dedicated medical tags? Finally, does this extension have a bearing on the performance of tagging medical documents and, if so, to what extent? We will elaborate on these questions in Section 4.</Paragraph> </Section> class="xml-element"></Paper>