File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1044_intro.xml

Size: 2,790 bytes

Last Modified: 2025-10-06 14:06:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1044">
  <Title>Syntagmatic and Paradigmatic Representations of Term Variation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the classical approach to text retrieval, terms are assigned to queries and documents. The terms are generated by a process called automatic indexing. Then, given a query, the similarity between the query and the documents is computed and a ranked list of documents is produced as output of the system for information access (Salton and McGill, 1983).</Paragraph>
    <Paragraph position="1"> The similarity between queries and documents depends on the terms they have in common. The same concept can be formulated in many different ways, known as variants, which should be conflated in order to avoid missing relevant documents. For this purpose, this paper proposes a novel model of term variation that integrates linguistic knowledge and performs accurate term normalization. It relies on previous or ongoing linguistic studies on this topic (Sparck Jones and Tait, 1984; Jacquemin et al., 1997; Hamon et al., 1998). Terms are described in a two-tier framework composed of a paradigmatic level and a syntagmatic level that account for the three linguistic dimensions of term variability (morphology, syntax, and semantics). Term variants are extracted from tagged corpora through FASTR 1, a unification-based transformational parser described in (Jacquemin et al., 1997).</Paragraph>
    <Paragraph position="2"> Four experiments are performed on the French and the English languages and a measure of precision is provided for each of them. Two experiments are made on a French corpus \[AGRIC\] composed of 1.2 x 106 words of scientific abstracts in I FASTR can be downloaded www. limsi, f r/Individu/j acquemi/FASTR.</Paragraph>
    <Paragraph position="3"> from the agricultural domain and two on an English corpus \[MEDIC\] composed of 1.3 x 106 words of scientific abstracts in the medical domain. The two experiments in the French language are \[AGRIC\] + Word97 and \[AGRIC\] + AGROVOC. In the former, synonymy links are extracted from the Microsoft Word97 thesaurus; in the latter, semantic classes are extracted from the AGROVOC thesaurus, a thesaurus specialized in the agricultural domain (AGROVOC, 1995). In both experiments, morphological data are produced by a stemming algorithm applied to the MULTEXT lexical database (MULTEXT, 1998). The two experiments on the English language are \[MEDIC\] + WordNet 1.6 or \[MEDIC\] + Word97; they correspond to two different sources of semantic knowledge. In both cases, the morphological data are extracted from CELEX (CELEX, 1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML