HEP Collection

Resource type

This corpus is oriented to the study of multi-label classifiers text. It consists of scientific papers in the field of High Energy Physics (HEP – High Energy Physics) obtained by the CDS document server of European Nuclear Physics Laboratory (CERN). The corpus is divided into three subsets (called partitions), where each partition consists in two files: one containing the records of each item (with information such as the abstract, authors and, of course, classes or key words) in compressed XML format, and other that contains a plain text version of the complete paper generated from the PDF available at CERN databases (tar + gzip format). Classes are defined by the XML mark KEYWORD. These are the labels manually assigned from thesaurus DESY. You can get more information about the thesaurus DESY.

  • Partition hepth: 18,114 Theoretical Physics documents (metadata – 5,3 Mb) (papers – 226 Mb)
  • Partition hepex: 2,599 Experimental Physics documents(metadata – 1,6 Mb) (papers – 28 Mb)
  • Partition astroph: 2,716 Astrophysics documents (metadata – 1,1 Mb) (papers – 29 Mb)

Updated on 29.09.2021: Thanks to Jaime Collado, from the University of Jaén, for generating an updated version of the XML files, ensuring their compatibility with current parsers and rules.

Updated on 23.04.2007: Thanks to Ioannis Katakis, from Aristotle University of Thessaloniki, (Greece) por corregir algunos problemas en el XML proporcionado. How to reference This corpus has been prepared by Arturo Montejo Ráez with metadata supplied by Jens Vigen and CDS Support Team. For references use:

How to cite

@Article{montejo2004, author = {Montejo-Ráez, A. and Steinberger, R. and Ureña-López, L. A.} title = {Adaptive selection of base classifiers in one-against-all learning for large multi-labeled collections}, booktitle = {Advances in Natural Language Processing: 4th International Conference, EsTAL 2004}, pages = {1--12}, year = {2004}, editor = {Vicedo J. L. et al.}, location = {Alicante, Spain}, number = {3230}, series = {Lectures notes in artifial intelligence}, publisher = {Springer} }