Cleansed Text Files in XML Format

As described elsewhere, the ACL ARC documents are processed using PDFBox and ParsCit libraries. The output of ParsCit is further processed and the text files are organized into sections and paragraphs. Sections are further annotated by their type (Note the output is noisy). These files are organised by their publication date(year) and can be downloaded from the list given below:


Text Files in XML Format


Size:Name:Description:
90.760.177  XML.ZIPCleansed text files in XML format.
18.084  I05-3019_CLN.XMLExample of a cleansed text file which can be found in the XML.ZIP file.
<DIR>  XML_BY_SECTION/The above XML files broken into a set of files, each file contains specific type of sections, e.g. abstract, acknowledgement, etc.
91.454.499  XML_BY_SECTION.ZIPThe contents of the XML_BY_SECTION/ folder in one zip file.

Directory contains 182.232.760 Bytes in 3 Files

Index of: XML_BY_SECTION/


The extracted text from publications organized by section type and publication date.
<Up to the higher level directory>
Size:Name:Description:
10.118.757  ABSTR.ZIPExtracted abstract sections
1.784.803  ACKNO.ZIPExtracted acknowledgement sections.
6.633.949  CONCL.ZIPExtracted conclusion sections.
4.624.872  EVALU.ZIPExtracted evaluation sections.
12.545.505  INTRO.ZIPExtracted introduction sections.
53.735.039  METHO.ZIPExtracted method sections.
950.314  RELAT.ZIPExtracted related work sections.

Directory contains 90.393.239 Bytes in 7 Files

Total: 272.625.999 Bytes in 10 Files

This page last edited on 06 October 2025.




*** ***