File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-0903_abstr.xml

Size: 7,739 bytes

Last Modified: 2025-10-06 13:41:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0903">
  <Title>Comparing corpora and lexical ambiguity</Title>
  <Section position="1" start_page="0" end_page="15" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper we compare two types of corpus, focusing on the lexical mnbiguity of each of them. The first corpns consists mainly of newspaper articles and Hterature excerpts, while the second belc)ngs to the medical domain. To conduct the study, we have used two different disambiguation tools. However, first of all, we must verify the performance of each system in its respective application domain. We then use these systems in order to assess and compare both the general ambiguity rate and the particularities of each domain. (mantitative results show that medical documents are lexically less ambiguous than tmrestrieted documents. Our conclusions show the importance of the application area in the design of NLP tools.</Paragraph>
    <Paragraph position="1"> Introduction and background Although some large-scale evaluations carried out on unrestricted texts (Hersh 1998a, Spark-Jones 1999), and even on medical documents (Hersh 1998b), conclude in a quite critical way about using NLP tools for information retrieval, we believe that such tools are likely to solve some lexical ambiguity issues. We also believe that some special settings -particular to the application area- must be taken into account while developing such NLP tools.</Paragraph>
    <Paragraph position="2"> Let us recall two major problems while retrieving documents with NLP engines (Salton, 1988): 1-Expansion: the user is generally as interested in retrieving documents with exactly the same words, as in retrieving documents with semantically related words (synonyms, generics, specifics...). Thus, a query based on the word liver, should be able to retrieve documents containing words such as hepatic. This expansion process is usually thesaurus-based. The thesaurus can be built manually or automatically (as, for ex~ple, in Nazarenko, 1997).</Paragraph>
    <Paragraph position="3"> 2-Disambiguation: a search based on tokens may retrieve irrelevant documents since tokens are often lexically ambiguous. Thus, face can refer to a body part, as a noun, or an action, as a verb. Finally, this latter problem may be split into two sub problems. The disambiguafion task can be based on parts-of-speech (POS) or word-sense (WS) information, but the chronological relation is still a discussion within the community.</Paragraph>
    <Paragraph position="4"> Although, the target of our work (Ruch and al., 1999, Bouillon and al., 2000) is a free-grained semantic disambiguation of medical texts for IR purposes, we believe that the POS disambiguation is an important preliminary step. Therefore this paper focuses on POS tagging, and compares morpho-syntacfic lexical ambiguities (MSLA) in medical texts to MSLA in unrestricted corpora.</Paragraph>
    <Paragraph position="5"> Although the results of the study conform to preliminary naive expectations, the method is quite original I. Most of the comparative studies, dedicated to corpora, have addressed the problem by applying metrics on words entities or word pieces (as in studies working with n-I We do not claim to be pioneer in the domain, as others authors (Biber 1998, Folch and al., 2000) axe exploring similar metrics. However, it is interesting to notice that for these authors the adaptation of the NLP tools has rarely been questioned in a technical point-of view, and in order to feed back the design of NLP systems.</Paragraph>
    <Paragraph position="6">  gram strings), or on special sets of words (the indexing terms, see Salton, 1988) as in the space-vector model (see Kilgariff, 1996, for a survey of these methods), whereas the present paper attempts to compare corpora at a morpho-syntactic (MS) level 1 Validating each tagger into its respective domain In order to conduct the comparative study, we used two different morphological analysers; each one has a specific lexicon tailored for its application field. The first system is specialised for tagging medical texts (Ruch and al., 2000), while the second is a general parser (based on FIPS, cf. Wehrli, 1992).</Paragraph>
    <Paragraph position="7"> For comparing lexical ambiguities on a minimal common base, the output of each morphological analyser is first mapped into its respective tagset (more than 300 fine-grained tags for FIPSTAG, and about 80 for the morpheme-based medical tagger). The tagsets are then converted into a subset of the medical tagger. FinaUy, about 50 different items constitute this minimal common tagset (MCT), which will serve for comparing both corpora.</Paragraph>
    <Paragraph position="8"> We collected two different sets of documents to be tagged at a lexical level via the predefined MCT: this step provides a set of tags to every token. This set of tags may come from the lexicon or from the POS guesser. As we are using guessers, the empty set (or the tag for unknown tokens) is forbidden. However, first of all, it is necessary to verify the lexical coverage of each system for each corpus, as we need to be sure that the lexical ambiguities provided by each system are necessary and sufficient.</Paragraph>
    <Paragraph position="9"> The corpus of the unrestricted texts consists of 16003 tokens: about one third of newspaper articles (Le Monde), one third of literature excerpts (provided by the InaLF, http://www.inalf.fr), and a smaller third being mainly texts for children. Approximately a quarter (3987 tokens) of this corpus is used for evaluating FIPSTAG tagging results (the tool together with some explanations can be found at http://latl.unige.ch). In parallel, we chose three types of medical texts to make up the medical corpus: it represents 16024 tokens, with 3 equal thirds: discharge summaries, surgical reports, and laboratory or test results (in this case, tables were removed). Again, a regularly distributed quarter (4016) of this corpus is used for assessing the medical tagger.</Paragraph>
    <Paragraph position="10"> The test samples used for assessing the results of each tagger are annotated manually before measuring the performances, but in both cases we sometimes had to modify the word segmentation of the test samples. This is particularly true for FIPSTAG, which handles several acceptable but unusual collocations (which gather more than one 'word'), as for example en avion (in Eng. by plane), which is considered as one lexical item, tagged as an adverb. For the lexical tagger we had to modify the &amp;quot;word' segmentation in the other direction (for tagging items smaller than 'words'), as morphemes were also tagged. Table 1 gives the results for FIPSTAG, and table 2 gives the results for the medical tagger. In the case of the medical tagger, together with the error rate and the success rate, we provide results of the residual ambiguity rate: the basic idea is that the system does not attempt to solve what it is not likely to solve well (cf. Ruch and al. 2000, a similar idea can be found in Silberztein 1997).  A comparison of the tagging scores (99.3 vs.</Paragraph>
    <Paragraph position="11">  98.5) confirms that both systems behave in an equivalent way in their respective application area 2.</Paragraph>
    <Paragraph position="12"> 2 Out of curiosity, we ran each tagger on a small sample of the other domain. The tests were made without any adaptation. FIPSTAG made 27 errors in a medical sample of 849 tokens, i.e. an error rate of 3.2%. The medical tagger made 18 errors in a general sample of 747 tokens, which means an error rate of 2.4%. In the case of the medical tagger, 41 tokens</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML