Candidate Terms Extracted Using a set of Part-of-Speech patterns

Extracted candidate terms using a set of part-of-speech patterns; below are files to download.


Index of: CANDIDATE_TERM/


Size:Name:Description:
12.799.467  _ALL_CANDID_TERM_
BY_POS.ZIP
This file contains all the extracted candidate terms using the devised part-of-speech patterns. Each line of the file represent the following information:
  • TERM_ID: an assigned universal integer id to the candidate term
  • STRING LENGTH: length of candidate term
  • CORPUS_FREQ: the number of occurrences of the candidate term in the segmented pre-processed corpus (i.e. SEPID_CORPUS), in other words the term frequency (tf).
  • DOCUMENT_FREQ: the number of documents in which the candidate term has been occurred, i.e. the term document frequency which can be used for calculating the inverse document frequency.
  • SECTION_FREQ: the number of sections in which the term has been occurred.
  • PARAGRPAH_FREQ: the number of paragraphs in which the term has been occurred.
12.132.390  _ALL_CANDID_TERM_
BY_POS_
DOCUMENT_INDEX.ZIP
An inverted index file that maps terms to documents in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by DOCUMENT_ID (tab separated). Please note DOCUMENT_ID corresponds to an integer id that is assigned to each document in the SEPID_CORPUS.
15.097.774  _ALL_CANDID_TERM_
BY_POS_
SECTION_INDEX.ZIP
Similar as above, however, for sections: an inverted index file that maps terms to sections in the corpus. Each line of the file shows a single occurrence of a term in the form of TERM_ID followed by SECTION_ID (tab separated).
19.778.365  _ALL_CANDID_TERM_
BY_POS_
PARAGRAPH_INDEX.ZIP
Similar as above, however, for paragraphs, i.e. TERM_ID followed by PARAGRAPH_ID (tab separated from SEPID_CORPUS).
48.039.732  _ALL_CANDID_TERM_
BY_POS_
SENTENCE_INDEX.ZIP
Similar as above however for sentences. The format of the file is TERM_ID followed by SENTENCE_ID followed by START and END positions of the term. START and END are the token numbers in the sentence.
365  POS_SEQUENCE_
FILTER
The employed part-of-speech tag sequence patterns for the extraction of candidate terms.
<DIR>  CANDID_TERM_
BY_POS_
SENTENCE_INDEX/
In this folder, the (candidate-term-id, sentence-id) indices (i.e. in _all_candid_term_by_pos_sentence_index.zip) are grouped by the date(year) of publication of source documents. The first two letters of filenames show the year of publication. For instance, the file "84_candid_term_by_pos_pattern_sentence_index.zip" contains all sentence--term-id mapping from the corpus in which the sentences are from the publications in the year 84. These files together with the additional provided index files in SEPID_CORPUS can be used to organize candidate terms in a chronological order. There are currently 34 files, representing publications from 67 (i.e. 1967) to 06 (i.e. 2006).

Directory contains 124.293.185 Bytes in 7 Files

Index of: CANDID_TERM_BY_POS_PATTERN_SENTENCE_INDEX/


<Up to the higher level directory>

To download all these files in one zip file click here.
Size:Name:Description:
2.720.866  00_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2000.
1.413.694  01_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2001.
2.062.648  02_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2002.
2.396.856  03_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2003.
4.404.305  04_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2004.
2.505.827  05_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2005.
4.884.960  06_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 2006.
108.336  65_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1965.
105.363  67_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1967.
217.444  69_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1969.
157.564  73_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1973.
146.083  75_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1975.
193.892  78_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1978.
1.060.563  79_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1979.
592.510  80_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1980.
243.011  81_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1981.
532.555  82_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1982.
532.554  83_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1983.
534.551  84_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1984.
585.908  85_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1985.
1.072.015  86_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1986.
656.015  87_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1987.
1.415.295  88_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1988.
991.309  89_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1989.
1.555.268  90_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1990.
1.378.971  91_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1991.
2.248.958  92_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1992.
1.738.044  93_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1993.
2.510.786  94_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1994.
964.317  95_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1995.
2.095.609  96_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1996.
2.034.305  97_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1997.
2.642.241  98_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1998.
1.481.054  99_CANDID_TERM_BY_POS_PATTERN_
SENTENCE_INDEX.ZIP
Term-Sentence indices from articles published in year 1999.

Directory contains 48.183.677 Bytes in 34 Files

Total: 172.476.862 Bytes in 41 Files

This page last edited on 06 October 2025.




*** ***