File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0309_intro.xml

Size: 4,527 bytes

Last Modified: 2025-10-06 14:01:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0309">
  <Title>Biomedical Text Retrieval in Languages with a Complex Morphology</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Morphological alterations of a search term have a negative impact on the recall performance of an information retrieval (IR) system (Choueka, 1990; J&amp;quot;appinen and Niemist&amp;quot;o, 1988; Kraaij and Pohlmann, 1996), since they preclude a direct match between the search term proper and its morphological variants in the documents to be retrieved. In order to cope with such variation, morphological analysis is concerned with the reverse processing of inflection (e.g., 'searcha0 ed', 'searcha0 ing')1, derivation (e.g., 'searcha0 er' or 'searcha0 able') and composition (e.g., German 'Bluta0 hocha0 druck' ['high blood pressure']). The goal is to map all occurring morphological variants to some canonical base form -e.g., 'search' in the examples from above.</Paragraph>
    <Paragraph position="1"> The efforts required for performing morphological analysis vary from language to language. For English, known for its limited number of inflection patterns, lexicon-free general-purpose stem- null a1 ' denotes the string concatenation operator.</Paragraph>
    <Paragraph position="2"> mers (Lovins, 1968; Porter, 1980) demonstrably improve retrieval performance. This has been reported for other languages, too, dependent on the generality of the chosen approach (J&amp;quot;appinen and Niemist&amp;quot;o, 1988; Choueka, 1990; Popovic and Willett, 1992; Ekmekc,ioglu et al., 1995; Hedlund et al., 2001; Pirkola, 2001). When it comes to a broader scope of morphological analysis, including derivation and composition, even for the English language only restricted, domain-specific algorithms exist. This is particularly true for the medical domain. From an IR view, a lot of specialized research has already been carried out for medical applications, with emphasis on the lexico-semantic aspects of dederivation and decomposition (Pacak et al., 1980; Norton and Pacak, 1983; Wolff, 1984; Wingert, 1985; Dujols et al., 1991; Baud et al., 1998).</Paragraph>
    <Paragraph position="3"> While one may argue that single-word compounds are quite rare in English (which is not the case in the medical domain either), this is certainly not true for German and other basically agglutinative languages known for excessive single-word nominal compounding. This problem becomes even more pressing for technical sublanguages, such as medical German (e.g., 'Bluta0 drucka0 messa0 ger&amp;quot;at' translates to 'device for measuring blood pressure').</Paragraph>
    <Paragraph position="4"> The problem one faces from an IR point of view is that besides fairly standardized nominal compounds, which already form a regular part of the sublanguage proper, a myriad of ad hoc compounds are formed on the fly which cannot be anticipated when formulating a retrieval query though they appear in relevant documents. Hence, enumerating morphological variants in a semi-automatically generated lexicon, such as proposed for French (Zweigenbaum et al., 2001), turns out to be infeasible, at least for German and related languages.</Paragraph>
    <Paragraph position="5"> Association for Computational Linguistics.</Paragraph>
    <Paragraph position="6"> the Biomedical Domain, Philadelphia, July 2002, pp. 61-68. Proceedings of the Workshop on Natural Language Processing in Furthermore, medical terminology is characterized by a typical mix of Latin and Greek roots with the corresponding host language (e.g., German), often referred to as neo-classical compounding (Mc-Cray et al., 1988). While this is simply irrelevant for general-purpose morphological analyzers, dealing with such phenomena is crucial for any attempt to cope adequately with medical free-texts in an IR setting (Wolff, 1984).</Paragraph>
    <Paragraph position="7"> We here propose an approach to document retrieval which is based on the idea of segmenting query and document terms into basic subword units. Hence, this approach combines procedures for deflection, dederivation and decomposition. Subwords cannot be equated with linguistically significant morphemes, in general, since their granularity may be coarser than that of morphemes (cf. our discussion in Section 2). We validate our claims in Section 4 on a substantial biomedical document collection (cf. Section 3).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML