File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1028_intro.xml

Size: 1,992 bytes

Last Modified: 2025-10-06 14:02:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1028">
  <Title>Improved Automatic Keyword Extraction Given More Linguistic Knowledge</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Automatic keyword assignment is a research topic that has received less attention than it deserves, considering keywords' potential usefulness. Keywords may, for example, serve as a dense summary for a document, lead to improved information retrieval, or be the entrance to a document collection. However, relatively few documents have keywords assigned, and therefore finding methods to automate the assignment is desirable.</Paragraph>
    <Paragraph position="1"> A related research area is that of terminology extraction (see e.g., Bourigault et al. (2001)), where all terms describing a domain are to be extracted.</Paragraph>
    <Paragraph position="2"> The aim of keyword assignment is to find a small set of terms that describes a specific document, independently of the domain it belongs to. However, the latter may very well benefit from the results of the former, as appropriate keywords often are of a terminological character.</Paragraph>
    <Paragraph position="3"> In this work, the automatic keyword extraction is treated as a supervised machine learning task, an approach first proposed by Turney (2000). Two important issues are how to define the potential terms, and what features of these terms are considered discriminative, i.e., how to represent the data, and consequently what is given as input to the learning algorithm. In this paper, experiments with three term selection approaches are presented: n-grams; noun phrase (NP) chunks; and terms matching any of a set of part-of-speech (POS) tag sequences. Four different features are used: term frequency, collection frequency, relative position of the first occurrence, and the POS tag(s) assigned to the term.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML