File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4005_metho.xml

Size: 8,626 bytes

Last Modified: 2025-10-06 14:08:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4005">
  <Title>Enhancing Linguistically Oriented Automatic Keyword Extraction</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Refinements
</SectionTitle>
    <Paragraph position="0"> In this section, three minor modifications made to the keyword extraction algorithm are presented. The first one concerns how the NP-chunks are extracted from the documents: By removing the initial determiner of the NPchunks, a better performance is achieved. The second alteration is to use a general corpus for calculating the collection frequency value. Also the weights for the positive examples are set in a more systematic way, to maximise the performance of the combined model.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Refining the NP-chunk Approach
</SectionTitle>
      <Paragraph position="0"> It was noted in Hulth (2003b) that when extracting NPchunks, the accompanying determiners are also extracted (per definition), but that determiners are rarely found at the initial position of keywords. This means that the automatic evaluation treats such keywords as misclassified, although they might have been correct without the determiner. For this reason the determiners a, an, and the are removed when occurring in the beginning of an extracted NP-chunks. The results for the runs when extracting NP-chunks with and without these determiners are found in Table 1. As can be seen in this table, the recall increases while the precision decreases. However, the high increase in recall leads to an increase in the F-measure from 33.0 to 36.8.</Paragraph>
      <Paragraph position="1"> Assign. Corr. P R F With det. 9.6 2.8 29.7 37.2 33.0 Without det. 15.0 4.2 27.7 54.6 36.8  tial determiners a, an, and the.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Using a General Corpus
</SectionTitle>
      <Paragraph position="0"> In the experiments presented in Hulth (2003a), only the documents present in the training, validation, and test set respectively are used for calculating the collection frequency. This means that the collection is rather homogenous. For this reason, the collection frequency is instead calculated on a set of 200 arbitrarily chosen documents from the British National Corpus (BNC). In Table 2, the performance of two runs when taking the majority vote of the three classifiers removing the subsumed terms is presented. The first run ('Abstracts') is when the collection frequency is calculated from the abstracts. The second run ('Gen. Corp.') is when the BNC documents are used for this calculation. If comparing these two runs, the F-measure increases. In other words, using a more general corpus for this calculation leads to a better performance of the automatic keyword extraction.</Paragraph>
      <Paragraph position="1"> Assign. Corr. P R F</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Setting the Weights
</SectionTitle>
      <Paragraph position="0"> As the data set is unbalanced--there is a larger number of negative than positive examples--the positive examples are given a higher weight when training the prediction models. In the experiments discussed so far, the weights given to the positive examples are those resulting in the best performance for each individual classifier (as described in Hulth (2003a)). For the results presented further, the weights are instead set according to which individual weight that maximises the F-measure for the combined model on the validation set. The weight given to the positive examples for each term selection approach has in a (rather large) number of runs been altered systematically for each classifier, and the combination that results in the best performance is selected. The results on the test set are presented in Table 3. As can be seen in this table, the recall decreases, while the precision and the F-measure increase.</Paragraph>
      <Paragraph position="1"> Assign. Corr. P R F  ual weight and with the best combination, respectively.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Regression vs. Classification
</SectionTitle>
    <Paragraph position="0"> In the experiments presented in Hulth (2003a), the automatic keyword indexing task is treated as a binary classification task, where each candidate term is classified either as a keyword or a non-keyword. RDS allows for the prediction to be treated as a regression task (Breiman et al., 1984). This means that the prediction is given as a numerical value, instead of a category. When training the regression models, the candidate terms being manually assigned keywords are given the value one, and all other candidate terms are assigned the value zero. In this fashion, the prediction is a value between zero and one, and the higher the value, the more likely a candidate term is to be a keyword (according to the model).</Paragraph>
    <Paragraph position="1"> To combine the results from the three models, there are two alternatives. Either the prediction value can be added for all candidate terms, or it can be added only if it is over a certain threshold set for each model, depending on the model's individual performance. Regardless, a candidate term may be selected as a keyword even if it is extracted by only one method, provided that the value is high enough. The threshold values are defined based on the performance of the models on the validation data.</Paragraph>
    <Paragraph position="2"> In Table 4, results for two regression runs on the test data are presented. These two runs are in Table 4 compared to the best performing classification run. The first regression run ('Regression') is when all candidate terms having an added value over a certain threshold are selected. The second presented regression run (Regression with individual threshold: 'Reg. ind. thresh.') is when a threshold is set for each individual model: If a prediction value is below this threshold it does not contribute to the added value for a candidate term. In this case, the threshold for the total score is slightly lower than when no individual threshold is set. Both regression runs have a higher F-measure than the classification run, due to the fact that recall increases, more than what the precision decreases. The run without individual thresholds results in the highest F-measure.</Paragraph>
    <Paragraph position="3"> Assign. Corr. P R F  thesh.' refers to a run where the regression value from each model contributes only if it is over a certain threshold. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Defining the Number of Keywords
</SectionTitle>
      <Paragraph position="0"> If closer inspecting the best regression run, this combined model assigns on average 10.8 keywords per document.</Paragraph>
      <Paragraph position="1"> The actual distribution varies from 3 documents with 0 to 1 document with 32 keywords. As mentioned, the prediction value from a regression model is numeric, and indicates how likely a candidate term is to be a keyword. It is thus possible to rank the output, and consequently to limit the number of keywords assigned per document. A set of experiments has been performed with the aim to find what number of keywords per document that results in the highest F-measure, by varying the number of keywords assigned. In these experiments, only terms with an added value over the threshold are considered, and the candidate terms with the highest values are selected first. The best performance is when the maximum of twelve keywords is selected for each document. (The subsumed terms are removed after that the maximum number of keywords is selected.) As can be seen in Table 5 ('All' compared to 'Max. 12'), the F-measure decreases as does the recall, although the precision increases, when limiting the number of keywords.</Paragraph>
      <Paragraph position="2"> There are, however, still some documents that do not get any selected keywords. To overcome this, three terms are assigned to each document even if the added regression value is below the threshold. Doing this gives a slightly lower precision, while the recall increases slightly. The F-measure is unaffected (see Table 5: 3-12).</Paragraph>
      <Paragraph position="3"> Assign. Corr. P R F  and limiting the number of terms assigned per document (Max. 12, and 3-12 respectively).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML