Cleansed Segmented Text Files | ACL RD-TEC | Behrang Q. Zadeh | Senior Data Scientist at Henkel's Center of Data Analytics

Cleansed Text Files in XML Format

As described elsewhere, the ACL ARC documents are processed using PDFBox and ParsCit libraries. The output of ParsCit is further processed and the text files are organized into sections and paragraphs. Sections are further annotated by their type (Note the output is noisy). These files are organised by their publication date(year) and can be downloaded from the list given below:

Text Files in XML Format
Size:	Name:	Description:
90.760.177	XML.ZIP	Cleansed text files in XML format.
18.084	I05-3019_CLN.XML	Example of a cleansed text file which can be found in the XML.ZIP file.
<DIR>	XML_BY_SECTION/	The above XML files broken into a set of files, each file contains specific type of sections, e.g. abstract, acknowledgement, etc.
91.454.499	XML_BY_SECTION.ZIP	The contents of the XML_BY_SECTION/ folder in one zip file.
Directory contains 182.232.760 Bytes in 3 Files
Index of: XML_BY_SECTION/
The extracted text from publications organized by section type and publication date.
<Up to the higher level directory>
Size:	Name:	Description:
10.118.757	ABSTR.ZIP	Extracted abstract sections
1.784.803	ACKNO.ZIP	Extracted acknowledgement sections.
6.633.949	CONCL.ZIP	Extracted conclusion sections.
4.624.872	EVALU.ZIP	Extracted evaluation sections.
12.545.505	INTRO.ZIP	Extracted introduction sections.
53.735.039	METHO.ZIP	Extracted method sections.
950.314	RELAT.ZIP	Extracted related work sections.
Directory contains 90.393.239 Bytes in 7 Files
Total: 272.625.999 Bytes in 10 Files

This page last edited on 06 October 2025.

Cleansed Text Files in XML Format

Text Files in XML Format

Index of: XML_BY_SECTION/