File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1028_metho.xml
Size: 23,891 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1028"> <Title>Improved Automatic Keyword Extraction Given More Linguistic Knowledge</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Points of Departure </SectionTitle> <Paragraph position="0"> Treating the automatic keyword extraction as a supervised machine learning task means that a classifier is trained by using documents with known keywords. The trained model is subsequently applied to documents for which no keywords are assigned: each defined term from these documents is classified either as a keyword or a non-keyword; or--if a probabilistic model is used--the probability of the defined term being a keyword is given. Turney (2000) presents results for a comparison between an extraction model based on a genetic algorithm and an implementation of bagged C4.5 decision trees for the task. The terms are all stemmed unigrams, bigrams, and trigrams from the documents, after stopword removal. The features used are, for example, the frequency of the most frequent phrase component; the relative number of characters of the phrase; the first relative occurrence of a phrase component; and whether the last word is an adjective, as judged by the unstemmed suffix. Turney reports that the genetic algorithm outputs better keywords than the decision trees. Part of the same training and test material is later used by Frank et al. (1999) for evaluating their algorithm in relation to Turney's algorithm. This algorithm, which is based on naive Bayes, uses a smaller and simpler set of features-term frequency, collection frequency (idf), and relative position--although it performs equally well.</Paragraph> <Paragraph position="1"> Frank et al. also discuss the addition of a fourth feature that significantly improves the algorithm, when trained and tested on domain-specific documents.</Paragraph> <Paragraph position="2"> This feature is the number of times a term is assigned as a keyword to other documents in the collection.</Paragraph> <Paragraph position="3"> It should be noted that the performance of the state-of-the-art keyword extraction is much lower than for many other NLP-tasks, such as tagging and parsing, and there is plenty of room for improvements. To give an idea of this, the results obtained by the genetic algorithm trained by Turney (2000), and the naive Bayes approach by Frank et al. (1999) are presented. The number of terms assigned must be explicitly limited by the user for these algorithms.</Paragraph> <Paragraph position="4"> Turney and Frank et al. report the precision for five and fifteen keywords per document. Recall is not reported in their studies. In Table 1 their results when training and testing on journal articles are shown, and the highest values for the two algorithms are presented. null correct terms for Turney (2000)* and Frank et al.</Paragraph> <Paragraph position="5"> (1999)**, for five and fifteen extracted terms.</Paragraph> <Paragraph position="6"> There are two drawbacks in common with the approaches proposed by Turney (2000) and Frank et al. (1999). First, the number of tokens in a keyword is limited to three. In the data used to train the classifiers evaluated in this paper, 9.1% of the manually assigned keywords consist of four tokens or more, and the longest keywords have eight tokens.</Paragraph> <Paragraph position="7"> Secondly, the user must state how many keywords to extract from each document, as both algorithms, for each potential keyword, output the probability of the term being a keyword. This could be solved by manually setting a threshold value for the probability, but this decision should preferably be made by the extraction system.</Paragraph> <Paragraph position="8"> Finding potential terms--when no machine learning is involved in the process--by means of POS patterns is a common approach. For example, Barker and Cornacchia (2000) discuss an algorithm where the number of words and the frequency of a noun phrase, as well as the frequency of the head noun is used to determine what terms are keywords. An extraction system called LinkIT (see e.g., Evans et al. (2000)) compiles the phrases having a noun as the head, and then ranks these according to the heads' frequency. Boguraev and Kennedy (1999) extract technical terms based on the noun phrase patterns suggested by Justeson and Katz (1995); these terms are then the basis for a headline-like characterisation of a document. The final example given in this paper is Daille et al. (1994) who apply statistical filters on the extracted noun phrases. In that study it is concluded that term frequency is the best filter candidate of the scores investigated. When POS patterns are used to extract potential terms, the problem lies in how to restrict the number of terms, and only keep the ones that are relevant.</Paragraph> <Paragraph position="9"> In the case of professional indexing, the terms are normally limited to a domain-specific thesaurus, but not to those present only in the document to which they are assigned. For example, Steinberger (2001) presents work where as a first step, all lemmas after stop word removal in a document are ranked according to the log-likelihood ratio, thus a list of content descriptors is obtained. These terms are then used to assign thesaurus terms, that have been automatically assigned associating lemmas during a training phase. In this paper, however, the concern is not to limit the terms to a set of allowed terms.</Paragraph> <Paragraph position="10"> As opposed to Turney (2000) and Frank et al.</Paragraph> <Paragraph position="11"> (1999), who experiment with keyword extraction from full-length texts, this work concerns keyword extraction from abstracts. The reason for this is that many journal papers are not available as full-length texts, but as abstracts only, as is the case for example on the Internet.</Paragraph> <Paragraph position="12"> The starting point for this work was to examine whether the data representation suggested by Frank et al. was adequate for constructing a keyword extraction model from and for abstracts. As the results were poor, two alternatives to extracting n-grams as the potential terms were explored. The first approach was to extract all noun phrases in the documents as judged by an NP-chunker. The second selection approach was to define a set of POS tag sequences, and extract all words or sequences of words that matched any of these, relying on a PoS tagger. These two different approaches mean that the length of the potential terms is not limited to something arbitrary, but reflects a linguistic property. The solution to limiting the number of terms--as the majority of the extracted words or phrases are not keywords--was to apply a machine learning algorithm to decide which terms are keywords and which are not. The output from the machine learning algorithm is binary (a term is either a keyword or not), consequently the system itself limits the amount of extracted keywords per document. As for the features, a fourth feature was added to the ones used by Frank et al., namely the POS tag(s) assigned to the term. This feature turned out to dramatically improve the results.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Corpus </SectionTitle> <Paragraph position="0"> The collection used for the experiments described in this paper consists of 2 000 abstracts in English, with their corresponding title and keywords from the Inspec database. The abstracts are from the years 1998 to 2002, from journal papers, and from the disciplines Computers and Control, and Information Technology. Each abstract has two sets of keywords--assigned by a professional indexer-associated to them: a set of controlled terms, i.e., terms restricted to the Inspec thesaurus; and a set of uncontrolled terms that can be any suitable terms.</Paragraph> <Paragraph position="1"> Both the controlled terms and the uncontrolled terms may or may not be present in the abstracts. However, the indexers had access to the full-length documents when assigning the keywords. For the experiments described here, only the uncontrolled terms were considered, as these to a larger extent are present in the abstracts (76.2% as opposed to 18.1%).</Paragraph> <Paragraph position="2"> The set of abstracts was arbitrarily divided into three sets: a training set (to construct the model) consisting of 1 000 documents, a validation set (to evaluate the models, and select the best performing one) consisting of 500 documents, and a test set (to get unbiased results) with the remaining 500 abstracts. The set of manually assigned keywords were then removed from the documents. For all experiments the same training, validation, and test sets were used.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Building the Classifiers </SectionTitle> <Paragraph position="0"> This section begins with a discussion on the different ways the data were represented: in Section 4.1 the term selection approaches are described, and in Section 4.2 the features are discussed. Thereafter, a brief description of the machine learning approach is given. Finally in Section 4.4, the training and the evaluation of the classifiers are discussed.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Three Term Selection Approaches </SectionTitle> <Paragraph position="0"> In this section, the three different term selection approaches, in other words, the three definitions of what constitutes a term in a document, are described.</Paragraph> <Paragraph position="1"> n-grams In a first set of runs, the terms were defined in a manner similar to Turney (2000) and Frank et al. (1999). (Their studies were introduced in Section 2.) All unigrams, bigrams, and trigrams were extracted. Thereafter a stoplist was used (from Fox (1992)), where all terms beginning or ending with a stopword were removed. Finally all remaining tokens were stemmed using Porter's stemmer (Porter, 1980). In this paper, this manner of selecting terms is referred to as the n-gram approach.</Paragraph> <Paragraph position="2"> The implementation differs from Frank et al.</Paragraph> <Paragraph position="3"> (1999) in the following aspects: once (which is true for 80.0% of the keywords present in the training set).</Paragraph> <Paragraph position="4"> NP-chunks That nouns are appropriate as content descriptors seems to be something that most agree upon. When inspecting manually assigned keywords, the vast majority turn out to be nouns or noun phrases with adjectives, and as discussed in Section 2, the research on term extraction focuses on noun patterns. To not let the selection of potential terms be an arbitrary process--which is the case when extracting n-grams--and better capture the idea of keywords having a certain linguistic property, I decided to experiment with noun phrases.</Paragraph> <Paragraph position="5"> In the next set of experiments a partial parser1 was used to select all NP-chunks from the documents.</Paragraph> <Paragraph position="6"> Experiments with both unstemmed and stemmed terms were performed. This way of defining the terms is in this paper called the chunking approach.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> POS Tag Patterns </SectionTitle> <Paragraph position="0"> As about half of the manual keywords present in the training data were lost using the chunking approach, I decided to define another term selection approach. This still captures the idea of keywords having a certain syntactic property, but is based on empirical evidence in the training data.</Paragraph> <Paragraph position="1"> A set of POS tag patterns--in total 56--were defined, and all (part-of-speech tagged) words or sequences of words that matched any of these were extracted. The patterns were those tag sequences of the manually assigned keywords, present in the training data, that occurred ten or more times. This way of defining the terms is here called the pattern approach. As with the chunking approach, experiments with both unstemmed and stemmed terms were performed.</Paragraph> <Paragraph position="2"> Out of the 56 patterns, 51 contain one or more noun tags. To give an idea of the patterns, the ac.uk/software/pos/index.html (without the hyphen). null five most frequently occurring ones of the keywords present in the training data are proportion of the document preceding the first occurrence).</Paragraph> <Paragraph position="3"> The representation differed in that the term frequency and the collection frequency were not weighted together, but kept as two distinct features. In addition, the real values were not discretised, only rounded off to two decimals, thus more decision-making was handed over to the algorithm. The collection frequency was calculated for the three data sets separately.</Paragraph> <Paragraph position="4"> In addition, experiments with a fourth feature were performed. This is the POS tag or tags assigned to the term by the same partial parser used for finding the chunks and the tag patterns. When a term consists of several tokens, the tags are treated like a sequence. As an example, an extracted phrase like random JJ excitations NNS gets the atomic feature value JJ NNS. In case a term occurs more than once in the document, the tag or tag sequence assigned is the most frequently occurring one for that term in the entire document. In case of a draw, the first occurring one is assigned.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Rule Induction </SectionTitle> <Paragraph position="0"> As usual in machine learning, the input to the learning algorithm consists of examples, where an example refers to the feature value vector for each, in this case, potential keyword. An example that is a manual keyword is assigned the class positive, and those that are not are given the class negative. The machine learning approach used for the experiments is that of rule induction, i.e., the model that is constructed from the given examples, consists of a set of rules2. The strategy used to construct the rules is recursive partitioning (or divide-and-conquer), which has as the goal to maximise the separation between the classes for each rule.</Paragraph> <Paragraph position="1"> The system used allows for different ensemble techniques to be applied, meaning that a number of classifiers are generated and then combined to predict the class. The one used for these experiments is bagging (Breiman, 1996). In bagging, examples from the training data are drawn randomly with replacement until a set of the original size is obtained. This new set is then used to train a classifier. This procedure is repeated n times to generate n classifiers that then vote to classify an instance.</Paragraph> <Paragraph position="2"> It should be noted that my intention is not to argue for this machine learning approach in favour of any other. However, one advantage with rules is that they may be inspected, and thus might give an insight into how the learning component makes its decisions, although this is less applicable when applying ensemble techniques.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 The Training and the Evaluation </SectionTitle> <Paragraph position="0"> The feature values were calculated for each extracted unit in the training and the validation sets, that is for the n-grams, NP-chunks, stemmed NPchunks, patterns, and the stemmed patterns respectively. In other words, the within-document frequency, the collection frequency, and the proportion of the document preceding the first appearance for each potential term were calculated. Also, the POS tag(s) for each term were extracted. In addition, as the machine learning approach is supervised, the class was added, i.e., whether the term is a manually assigned keyword or not. For the stemmed terms, a unit was considered a keyword if it was equal to a stemmed manual keyword. For the unstemmed terms, the term had to match exactly.</Paragraph> <Paragraph position="1"> The measure used to evaluate the results on the validation set was the F-score, defined as</Paragraph> <Paragraph position="3"> combining the precision and the recall obtained. In this study, the main concern is the precision and the recall for the examples that have been assigned the class positive, that is how many of the suggested keywords are correct (precision), and how many of the manually assigned keywords that are found (recall). As the proportion of correctly suggested key-words is considered equally important as the amount of terms assigned by a professional indexer that was detected, a7 was assigned the value 1, thus giving precision and recall equal weights.</Paragraph> <Paragraph position="4"> When calculating the recall, the value for the total number of manually assigned keywords present in the documents is used, independent of the number actually present in the different representations. This figure varies slightly for the unstemmed and the stemmed data, and for the two the corresponding value is used.</Paragraph> <Paragraph position="5"> Several runs were made for each representation, with the goal to maximise the performance as evaluated on the validation set: first the weights of the positive examples were adjusted, as the data set is unbalanced. A better performance was obtained when the positive examples in the training data outnumbered the negative ones. Thereafter experiments with bagging were performed, and also, runs with and without the POS tag feature were made. The results are presented next.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 The Results </SectionTitle> <Paragraph position="0"> In this section, the results obtained by the best performing model for each approach--as judged on the validation set--when run on the previously unseen test set are presented. It should, however, be noted that the number of possible runs is very large, by varying for example the number of classifiers generated by the ensemble technique. It might well be that better results are possible for any of the representations. null As stemming with few exceptions led to better results on the validation set over all runs, only these values are presented in this section. In Table 2, the number of assigned terms and the number of correct terms, in total and on average per document are shown. Also, precision, recall, and the F-score are presented. For each approach, both the results with and without the POS tag feature are given.</Paragraph> <Paragraph position="1"> The length of the abstracts in the test set varies from 338 to 23 tokens (the median is 121 tokens).</Paragraph> <Paragraph position="2"> The number of uncontrolled terms per document is 31 to 2 (the median is 9 keywords). The total number of stemmed keywords present in the stemmed test set is 3 816, and the average number of terms is 7.63.</Paragraph> <Paragraph position="3"> Their distribution over the 500 documents is 27 to three documents with 0 terms, with the median being 7.</Paragraph> <Paragraph position="4"> As for bagging, it was noted that although the accuracy (i.e., the number of correctly classified positive and negative examples divided by the total number of examples) improved when increasing the number of classifiers, the F-score often decreased.</Paragraph> <Paragraph position="5"> For the pattern approach without the tag features the best model consists of a 5-bagged classifier, for the pattern approach with the tag feature a 20-bagged, and finally for the n-gram approach with the tag feature a 10-bagged classifier. For the other three runs a single classifier had the best performance.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Results of the n-gram Approach </SectionTitle> <Paragraph position="0"> When extracting the terms from the test set according to the n-gram approach, the data consisted of 42 159 negative examples, and 3 330 positive examples, thus in total 45 489 examples were classified by the trained model. Using this manner of extracting the terms meant that 12.8% of the keywords originally present in the test set were lost.</Paragraph> <Paragraph position="1"> To summarise the n-gram approach (see Table 2), without the tag feature it finds on average 4.37 keywords per document, out of originally on average 7.63 manual keywords present in the abstracts.</Paragraph> <Paragraph position="2"> However, the price paid for these correct terms is high: almost 38 incorrect terms per document.</Paragraph> <Paragraph position="3"> When adding the fourth feature, the number of correct terms decreases slightly, while the number of incorrect terms is decreased to a third. If looking at the actual distribution of assigned terms for these two runs, this varies between 134(!) and 5 without the tag feature, and from 48 to 1 with the tag feature.</Paragraph> <Paragraph position="4"> The median is 40 and 14 respectively.</Paragraph> <Paragraph position="5"> The F-scores (Fa1a1a0 a2 ) for these two runs are 17.6 and 33.9 respectively. 33.9 is the highest F-score that was achieved for the six runs presented here.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results of the Chunking Approach </SectionTitle> <Paragraph position="0"> When extracting the terms according to the stemmed chunking approach, the test set consisted of 13 579 negative, and 1 920 positive examples; in total 15 499 examples.</Paragraph> <Paragraph position="1"> An F-score (Fa1a1a0 a2 ) of 22.7 is obtained without the POS tag feature, and 33.0 with this feature. The number of terms on average per document is 16.38 without the tag feature, and 9.58 with it. If looking at each document, the number of keywords assigned varies from 46 to 0 (for three documents) with the median 16, and 29 to 0 (for four documents) with the median value being 9 terms.</Paragraph> <Paragraph position="2"> Extracting the terms with the chunking approach meant that slightly more than half of the keywords actually present in the test set were lost, and compared to the n-gram approach the number of correct terms assigned was almost halved. The number of incorrect keywords, however, decreased considerably. But, the difference is shown when the POS tag feature is included: the number of correctly assigned terms is more or less the same for this approach with or without the tag feature, while the number of incorrect terms is halved.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Results of the Pattern Approach </SectionTitle> <Paragraph position="0"> When extracting the terms according to the stemmed pattern approach, the test data consisted of 33 507 examples. Of these were 3 340 positive, and 30 167 negative. In total, 12.5% of the present keywords were lost.</Paragraph> <Paragraph position="1"> The F-scores (Fa1a1a0 a2 ) for the two runs, displayed in Table 2, are 25.6 (without the tag feature) and 28.1 (with the tag feature). The number of terms assigned on average per document is 5.04 and 3.05 without and with the tag feature respectively. The actual number of terms assigned per document is 100 to 0 (for three documents) without the tag feature, and 46 to 0 (for four documents) with the tag feature.</Paragraph> <Paragraph position="2"> The median is 30 and 12 respectively.</Paragraph> </Section> </Section> class="xml-element"></Paper>