Automatic document classification in the domain of High Energy Physics

This study is a proposed solution to the problem of massive multi-tagging of documents in general, and documents in the domain of high energy physics in particular.

This problem is called Text Categorization, in which predefined keywords are considered categories to be assigned to documents based on their textual content. During the development of this research, conducted mainly at CERN, the European Laboratory for Nuclear Research, the collection of documents revealed problems not previously covered by the literature. The express need for a solution to the management of such data that should go beyond mere scientific analysis and prototyping has marked the hypothesis throughout the study.

The results of the final solution implemented as a result of this investigation have opened up a wide range of applications, giving me the pleasant feeling of usability normally neglected in pure research. The reader will find out how exciting this task was, but what cannot be included here is the personal enrichment gained by working in an international environment for four years, with a team that facilitated the most advanced computational techniques to the community of CERN library users, the largest in the world of physics.

Author

Arturo Montejo Raez