File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/p06-1068_concl.xml
Size: 4,490 bytes
Last Modified: 2025-10-06 13:55:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1068"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Study on Automatically Extracted Keywords in Text Categorization</Title> <Section position="9" start_page="542" end_page="543" type="concl"> <SectionTitle> 6 Concluding Remarks </SectionTitle> <Paragraph position="0"> In the experiments described in this paper, we investigated if automatically extracted keywords can improve automatic text categorization. More specifically, we investigated what impact key-words have on the task of text categorization by making predictions on the basis of keywords only, represented either as unigrams or intact, and by combining the full-text representation with automatically extracted keywords. The combination was obtained by giving higher weights to words in the full-texts that were also extracted as keywords.</Paragraph> <Paragraph position="1"> Throughout the study, we were concerned with the data representation and feature selection procedure. We investigated what feature value should be used (boolean, tf, or tf*idf) and the minimum number of occurrence of the tokens in the training data.</Paragraph> <Paragraph position="2"> We showed that keywords can improve the performance of the text categorization. When key-words were used as a complement to the full-text representation an F-measure of 81.7% was ob2This method has also been used to extract keywords (Mihalcea and Tarau, 2004).</Paragraph> <Paragraph position="3"> tained, higher than without the keywords (81.0%).</Paragraph> <Paragraph position="4"> Our results also clearly indicate that keywords alone can be used for the text categorization task when treated as unigrams, obtaining an F-measure of 75.0%. Lastly, for higher precision (94.2%) in text classification, we can use the stemmed tokens in the headlines.</Paragraph> <Paragraph position="5"> The results presented in this study are lower than the state-of-the-art, even for the full-text run with unigrams, as we did not tune any other parameters than the feature values (boolean, term frequency, or tf*idf) and the threshold for the minimum number of occurrence in the training data.</Paragraph> <Paragraph position="6"> There are, of course, possibilities for further improvements. One possibility could be to combine the tokens in the headlines and keywords in the same way as the full-text representation was combined with the keywords. Another possible improvement concerns the automatic keyword extraction process. The keywords are presented in order of their estimated &quot;keywordness&quot;, based on the added regression value given by the three prediction models. This means that one alternative experiment would be to give different weights depending on which rank the keyword has achieved from the keyword extraction system. Another alternative would be to use the actual regression value.</Paragraph> <Paragraph position="7"> We would like to emphasize that the automatically extracted keywords used in our experiments are not statistical phrases, such as bigrams or trigrams, but meaningful phrases selected by including linguistic analysis in the extraction procedure. One insight that we can get from these experiments is that the automatically extracted keywords, which themselves have an F-measure of 44.0, can yield an F-measure of 75.0 in the categorization task. One reason for this is that the keywords have been evaluated using manually assigned keywords as the gold standard, meaning that paraphrasing and synonyms are severely punished. Kotcz et al. (2001) propose to use text categorization as a way to more objectively judge automatic text summarization techniques, by comparing how well an automatic summary fares on the task compared to other automatic summaries (that is, as an extrinsic evaluation method). The same would be valuable for automatic keyword indexing. Also, such an approach would facilitate comparisons between different systems, as common test-beds are lacking.</Paragraph> <Paragraph position="8"> In this study, we showed that automatic text categorization can benefit from automatically extracted keywords, although the bag-of-words representation is competitive with the best performance. Automatic keyword extraction as well as automatic text categorization are research areas where further improvements are needed in order to be useful for more efficient information retrieval.</Paragraph> </Section> class="xml-element"></Paper>