Corpora

COPOD

Resource type:

Corpus

Description:

The Corpus Of Patient Opinions in Dutch (COPOD) has been built by crawling the well-known medical forum Zorgkaart Nederland on June 28, 2016. It is composed of 156,975 patient reviews about their experiences with physicians of 60 specialties. Each review contains a rating for different aspects (accommodation, appointment, therapy, staff attention, information and listening), on a scale from 1 to 10 stars, and an overall rating that corresponds to the average of the ratings of these aspects.

How to cite:

Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Maks, I., & Izquierdo, R. (2017). Analysis of patient satisfaction in Dutch and Spanish online reviews. Procesamiento del Lenguaje Natural, 58, 101-108.

Files of the resource:

COPOD.zip

For any questions related to the corpus, please send an email to Salud María Jiménez Zafra or M. Teresa Martín-Valdivia.

DOS

Resource type:

Corpus

Description:

The Drug Opinions Spanish (DOS) corpus was sourced from the web portal https://www.mimedicamento.es, which is an independent platform for sharing experiences with drugs. It is composed of 877 opinions about the 30 most reviewed drugs by March 14, 2017. Each review contains information about the date in which it was posted, the gender and age of the consumer, the disease and the drug used for it, the textual opinion and a rating for the following satisfaction categories: overall, efficacy, side effects quantity, side effects severity and ease of use. Moreover, each review was manually annotated at aspect-level with the side effects described in them and with an opinion polarity label and an opinion intensity label according to the patients’ experiences. The corpus has 3,784 sentences containing a total of 2,230 side effects, out of which 98 are positive, 2,119 negative and 13 neutral. Regarding the intensity of the side effects, 655 are of high intensity, 1,486 of medium intensity and 89 of low intensity.

How to cite:

Jiménez-Zafra, S. M.,Martín-Valdivia, M. T., Molina-González, M. D. & Ureña-López, L. A. (2017). Corpus Annotation for Aspect Based Sentiment Analysis in Medical Domain. Proceedings of the 2nd International Workshop on Extraction and Processing of Rich Semantics from Medical Texts

Files of the resource:

DOS.zip

For any questions related to the corpus, please send an email to Salud María Jiménez-Zafra or M. Teresa Martín-Valdivia.

COPOS

Resource type:

Corpus

Description:

This corpus was extracted by crawling the website www.masquemedicos.com. The generated corpus is a collection of patient opinions about medical entities that come from six countries(Chile, Colombia,Ecuador, Spain, Mexico, Venezuela). It is composed of 743 reviews about 34 medical specialities. There are 109 reviews negative and 634 reviews positive. The reviews are rated on a scale from 0 to 5 stars.

How to cite:

del Arco, F. M. P., Valdivia, M. T. M., Zafra, S. M. J., González, M. D. M., & Cámara, E. M. (2016). COPOS: Corpus Of Patient Opinions in Spanish. Application of Sentiment Analysis Techniques. Procesamiento del Lenguaje Natural, 57, 83-90.

For any questions related to the corpus, please send an email to M. Teresa Martín-Valdivia  or Flor Miriam Plaza-del-Arco.

COAR

Resource type:

Corpora

Description:

COAR is a corpora of restaurants reviews for polarity classification tasks at document level. The corpus is composed by 2202 reviews from TripAdvisor, which are scored on a scale from 1 (negative) to 5 (positive). The number of opinions per each class is:

Rating 1 2 3 4 5 Total
#Opinions 565 246 188 333 870 2202

Files of the resource:

CorpusCOAR.xlsx

For any questions on the corpus sends an email to M. Dolores Molina or Eugenio Martínez

SFU-Review-SP-Neg

Resource type:

Corpus

Description:

This corpus is an extension of the SFU Spanish Review Corpus (Brooke et al., 2009) with annotations about negation and its scope. It is a collection of 400 reviews of cars, hotels, washing machines, books, cell phones, music, computers and movies from the Ciao.es website. Each domain contains 25 positive and 25 negative reviews. Each review has been annotated at the token level with the lemma and the PoS and at the sentence level with negative keywords, their linguistic scope, the event and how the polarity of the sentence is affected by negation (if there is a change in the polarity or an increment or reduction of its value), also taking into account intensifiers and diminishers.

How to cite:

Jiménez-Zafra, S. M., Taulé, M., Martín-Valdivia, M. T., Ureña-López, L. A., & Martí, M. A. (2018). SFU Review SP-NEG: a Spanish corpus annotated with negation for sentiment analysis. A typology of negation patterns. Language Resources and Evaluation, 52(2), 533-569.

Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2018). Relevance of the SFU Review SP-NEG corpus annotated with the scope of negation for supervised polarity classification in Spanish. Information Processing & Management, 54(2), 240-251.

Jiménez-Zafra, S. M., Martin, M., Lopez, L. A. U., Marti, T., & Taulé, M. (2016). Problematic cases in the annotation of negation in Spanish. In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM) (pp. 42-48).

Martí, M. A., Martín-Valdivia, M. T., Taulé, M., Jiménez-Zafra, S. M., Nofre, M., & Marsó, L. (2016). La negación en español: análisis y tipología de patrones de negación. Procesamiento del Lenguaje Natural, 57, 41-48.

Files of the resource:

Version 1.0.0: SFU_Review_SP_Neg.zip

For any questions related to the corpus, please send an email to Salud María Jiménez-Zafra or M. Teresa Martín-Valdivia.

COAH

Resource type:

Corpora

Description:

COAH is a corpora of hotel reviews for polarity classification tasks at document level. The corpus is composed by 1816 reviews from TripAdvisor, which are scored on a scale from 1 (negative) to 5 (positive). The number of opinions per each class is:

Rating 1 2 3 4 5 Total
#Opinions 312 199 285 489 531 1816

Some linguistic features of the corpora are:

Number of opinions 1816
Number of tokens 272446
Number of words 239749
Number of unique words 154297
Lexical diversity 0,6435
Number of characters 1372737
Number of characters without whitespaces 1135306
Number of nouns 55530
Number of verbs 40318
Number of adjectives 19935
Number of adverbs 16629
Number of lemmas 239749
Número de lemas únicos 138549
Lemmas diversity 0,577
Number of senses 106205
Number of unique senses 77397
Mean length of sentences 23,245
Mean of nouns 0,231
Mean of verbs 0,168
Mean of adjectives 0.083
Mean of adverbs 0.069

How to cite:

Molina-González, M. D., Martínez-Cámara, E., Martín-Valdivia, M. T., Ureña-López, L. A. (2014). Cross-domain sentiment analysis using spanish opinionated words. Natural Language Processing and Information Systems, Lecture Notes in Computer Science, vol. 8455, pp. 214-219. Springer International Publishing. DOI: 10.1007/978-3-319-07983-7_28

Files of the resource:

corpus_coah.xml

For any questions on the corpus sends an email to M. Dolores Molina or Eugenio Martínez

COST

Resource type:

Corpora

Description:

Corpus of Spanish tweets for sentiment analysis. The corpus is composed by 34634 tweets, which are tagged with noisy labels. 17317 of the tweets are positive and 17317 tweets are negative, so it is a balanced corpus.

How to cite:

Martínez-Cámara, E., Martín-Valdivia, M. T., Ureña-López, L. A., Mitkov, R. (2015). Polarity classification for Spanish tweets using the COST corpus. Journal of Information Science, 41(3), 263-272. DOI: 10.1177%2F0165551514566564.

Resource files:

To get the corpus you have to write an email to Eugenio Martínez Cámara (emcamara@ujaen.es)

 

Reuters Corpus

Resource type:

Corpora

Description:

2 CDs containing 810.000 English News by Reuters. In English language. Takes 2.5 GB uncompressed. Free (non-commercial) license. It is supplied under signed request and committing to reference them whenever you use for a paper.

OSHUMED

Resource type:

Corpora

Description:

Collection of documents (including documents, topics and relevance judgments) used in the TREC-9. The test collection consists of a set of 348.566 MEDLINE references.

20-Newsgroups

Resource type:

Corpora

Description:

20000 messages taken from 20 Usenet newsgroups. Available for scientific use.

Resource link:

Reuters

Resource type:

Corpora

Description:

Collection of text categorization. Resource for research in information retrieval, machine learning and other research-based corpus. Available for scientific use.

Resource link:

HEP Collection

Resource type:

Corpora

Description:

This corpus is oriented to the study of multi-label classifiers text. It consists of scientific papers in the field of High Energy Physics (HEP – High Energy Physics) obtained by the CDS document server of European Nuclear Physics Laboratory (CERN). The corpus is divided into three subsets (called partitions), where each partition consists in two files: one containing the records of each item (with information such as the abstract, authors and, of course, classes or key words) in compressed XML format, and other that contains a plain text version of the complete paper generated from the PDF available at CERN databases (tar + gzip format). Classes are defined by the XML mark KEYWORD. These are the labels manually assigned from thesaurus DESY. You can get more information about the thesaurus DESY.

  • Partition hepth: 18,114 Theoretical Physics documents (metadata – 5,3 Mb) (papers – 226 Mb)
  • Partition hepex: 2,599 Experimental Physics documents(metadata – 1,6 Mb) (papers – 28 Mb)
  • Partition astroph: 2,716 Astrophysics documents (metadata – 1,1 Mb) (papers – 29 Mb)

Updated on 23.04.2007: Thanks to Ioannis Katakis, from Aristotle University of Thessaloniki, (Greece) por corregir algunos problemas en el XML proporcionado. How to reference This corpus has been prepared by Arturo Montejo Ráez with metadata supplied by Jens Vigen and CDS Support Team. For references use:

@Article{montejo2004,
  author =        {Montejo-Ráez, A. and Steinberger, R. and Ureña-López,  L. A.}
  title =            {Adaptive selection of base classifiers in one-against-all
                      learning for large multi-labeled collections},
  booktitle =     {Advances in Natural Language Processing: 4th International
                      Conference, EsTAL 2004},
  pages =        {1--12},
  year =           {2004},
  editor =         {Vicedo J. L. et al.},
  location =      {Alicante, Spain},
  number =      {3230},
  series =        {Lectures notes in artifial intelligence},
  publisher =    {Springer}
}

Resource files

hep-collection.rar