» COAH

See full content »

Resource type:

Corpora

Description:

COAH is a corpora of hotel reviews for polarity classification tasks at document level. The corpus is composed by 1816 reviews from TripAdvisor, which are scored on a scale from 1 (negative) to 5 (positive). The number of opinions per each class is:

Rating 1 2 3 4 5 Total
#Opinions 312 199 285 489 531 1816

Some linguistic features of the corpora are:

Number of opinions 1816
Number of tokens 272446
Number of words 239749
Number of unique words 154297
Lexical diversity 0,6435
Number of characters 1372737
Number of characters without whitespaces 1135306
Number of nouns 55530
Number of verbs 40318
Number of adjectives 19935
Number of adverbs 16629
Number of lemmas 239749
Número de lemas únicos 138549
Lemmas diversity 0,577
Number of senses 106205
Number of unique senses 77397
Mean length of sentences 23,245
Mean of nouns 0,231
Mean of verbs 0,168
Mean of adjectives 0.083
Mean of adverbs 0.069

How to cite:

Molina-González, M. D., Martínez-Cámara, E., Martín-Valdivia, M. T., Ureña-López, L. A. (2014). Cross-domain sentiment analysis using spanish opinionated words. Natural Language Processing and Information Systems, Lecture Notes in Computer Science, vol. 8455, pp. 214-219. Springer International Publishing. DOI: 10.1007/978-3-319-07983-7_28

Files of the resource:

corpus_coah.xml

For any questions on the corpus sends an email to M. Dolores Molina or Eugenio Martínez

» COAR

See full content »

Resource type:

Corpora

Description:

COAR is a corpora of restaurants reviews for polarity classification tasks at document level. The corpus is composed by 2202 reviews from TripAdvisor, which are scored on a scale from 1 (negative) to 5 (positive). The number of opinions per each class is:

Rating 1 2 3 4 5 Total
#Opinions 565 246 188 333 870 2202

Files of the resource:

CorpusCOAR.xlsx

For any questions on the corpus sends an email to M. Dolores Molina or Eugenio Martínez

» COPOD

See full content »

Resource type:

Corpus

Description:

The Corpus Of Patient Opinions in Dutch (COPOD) has been built by crawling the well-known medical forum Zorgkaart Nederland on June 28, 2016. It is composed of 156,975 patient reviews about their experiences with physicians of 60 specialties. Each review contains a rating for different aspects (accommodation, appointment, therapy, staff attention, information and listening), on a scale from 1 to 10 stars, and an overall rating that corresponds to the average of the ratings of these aspects.

How to cite:

Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Maks, I., & Izquierdo, R. (2017). Analysis of patient satisfaction in Dutch and Spanish online reviews. Procesamiento del Lenguaje Natural, 58, 101-108.

Files of the resource:

COPOD.zip

For any questions related to the corpus, please send an email to Salud María Jiménez Zafra or M. Teresa Martín-Valdivia.

» COPOS

See full content »

Resource type:

Corpus

Description:

This corpus was extracted by crawling the website www.masquemedicos.com. The generated corpus is a collection of patient opinions about medical entities that come from six countries(Chile, Colombia,Ecuador, Spain, Mexico, Venezuela). It is composed of 743 reviews about 34 medical specialities. There are 109 reviews negative and 634 reviews positive. The reviews are rated on a scale from 0 to 5 stars.

How to cite:

del Arco, F. M. P., Valdivia, M. T. M., Zafra, S. M. J., González, M. D. M., & Cámara, E. M. (2016). COPOS: Corpus Of Patient Opinions in Spanish. Application of Sentiment Analysis Techniques. Procesamiento del Lenguaje Natural, 57, 83-90.

For any questions related to the corpus, please send an email to M. Teresa Martín-Valdivia  or Flor Miriam Plaza-del-Arco.

» COST

See full content »

Resource type:

Corpora

Description:

Corpus of Spanish tweets for sentiment analysis. The corpus is composed by 34634 tweets, which are tagged with noisy labels. 17317 of the tweets are positive and 17317 tweets are negative, so it is a balanced corpus.

Resource files:

To get the corpus you have to write an email to Eugenio Martínez Cámara (emcamara@ujaen.es)

 

» CRiSOL

See full content »

Resource type:

Lexicon

Description:

CRiSOL is the result of the combination of two linguistic resources for Sentiment Analysis. One of those resources is iSOL, which is a list of opinion bearing words in Spanish. The other one is the widely known opinion lexicon SentiWordNet. The result has been the filtered version of SentiWordNet by means the words that are in iSOL. The iSOL and SentiWordNet information that are in CRiSOL can be used jointly or indepently.

CRiSOL is composed by 8135 words of iSOL, from which 4434 are also linked with their polarity score in SentiWordNet.

How to cite:

Molina González, M. Dolores, Martínez Cámara, Eugenio, & Martín Valdivia, M. Teresa. (2015). CRiSOL: Opinion Knowledge-base for Spanish. Procesamiento Del Lenguaje Natural, 55, 143-150.
http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/5226

Files of the resource:

crisol.tar.gz

» DOS

See full content »

Resource type:

Corpus

Description:

The Drug Opinions Spanish (DOS) corpus was sourced from the web portal https://www.mimedicamento.es, which is an independent platform for sharing experiences with drugs. It is composed of 877 opinions about the 30 most reviewed drugs by March 14, 2017. Each review contains information about the date in which it was posted, the gender and age of the consumer, the disease and the drug used for it, the textual opinion and a rating for the following satisfaction categories: overall, efficacy, side effects quantity, side effects severity and ease of use. Moreover, each review was manually annotated at aspect-level with the side effects described in them and with an opinion polarity label and an opinion intensity label according to the patients’ experiences. The corpus has 3,784 sentences containing a total of 2,230 side effects, out of which 98 are positive, 2,119 negative and 13 neutral. Regarding the intensity of the side effects, 655 are of high intensity, 1,486 of medium intensity and 89 of low intensity.

How to cite:

Jiménez-Zafra, S. M.,Martín-Valdivia, M. T., Molina-González, M. D. & Ureña-López, L. A. (2017). Corpus Annotation for Aspect Based Sentiment Analysis in Medical Domain. Proceedings of the 2nd International Workshop on Extraction and Processing of Rich Semantics from Medical Texts

Files of the resource:

DOS.zip

For any questions related to the corpus, please send an email to Salud María Jiménez-Zafra or M. Teresa Martín-Valdivia.

» emoti-sp

See full content »

Resource type:

Lexicon

Description:

Linguistic resource for researching purposes in Sentiment Analysis on Spanish tweets. The lexicon is composed by 70 positive emoticons and 46 negative emoticons.

Files of the resource:

To download the resource you have to write an email to Salud M. Jiménez Zafra (sjzafra@ujaen.es) or Eugenio Martínez Cámara (emcamara@ujaen.es).

» eSOL

See full content »

Resource type:

Lexicon

Description:

iSOL is a list of domain-dependent opinion signal words in Spanish. The domain is the set of words of movie reviews.

The elaboration of the list was performed using a corpus-based approach. In this case it selected the Spanish Movie Reviews corpus. The list is composed of 2,535 positive words and 5,639 negative words. For more information on how the list was developed see the paper: Semantic Orientation for Polarity Classification in Spanish Reviews (In revision).

Molina-González M.D., Martínez-Cámara, E., Martín-Valdivia, M. T. & Perea-Ortega, J. M. (2012). Semantic orientation for polarity classification in Spanish reviews. Expert Systems with Applications.

http://dx.doi.org/10.1016/j.eswa.2013.06.076

Resource files:

esol.tar.gz

» eSOLdomainGlobal

See full content »

Resource type:

Lexicon

Description:

One of the main problems in Opinion Analysis is generating resources adapted for a specific domain. eSOLdomainGlobal is a set of lists of opinion signal words in Spanish that cover 8 different domains: cars, hotels, washing machines, books, mobile phones, music, computers and movies. The lists have been generated from the lexicon ISOL, and using a corpus-based approach taking the Spanish version of the SFU Review Corpus 8 lists have been generated.

Words

Positive

Negative

Cars

2528

5648

Hotels

2517

5636

Washers

2520

5639

Books

2529

5651

Mobile

2529

5657

Music

2538

5645

Computers

2527

5644

Films

2535

5648

Resource files:

eSOLdomainGlobal.rar

» EVOCA Corpus

See full content »

Resource type:

Corpora

Description:

EVOCA (English Version of OCA)
is an English corpus generated from the translation of the Arabic corpus OCA. This corpus contains reviews of movies and is divided into 250 positive reviews and 250 negative. Some statistics on EVOCA corpus. This corpus was translated in April 2011. Some statistics on it are shown in the following table:

Negative Positive
Total documents 250 250
Total tokens 122.135 153.581
Average tokens in each comment 488,54 614,32
Total sentences 5.030 3.483
Average sentence in each comment 20,12 13,93

Rushdi Saleh, M., Martín-Valdivia, M. T., Ureña-López, L. A. & Perea-Ortega, J. M. (2011). Bilingual Experiments with an Arabic-English Corpus for Opinion Mining. Proceedings of Recent Advances in Natural Language Processing, pages 740–745.

For any questions on the corpus sends an email to Mohammed Saleh or José M. Perea

Resource files:

EVOCA-corpus.rar

» Hashtags-sp

See full content »

Resource type:

Lexicon

Description:

Linguistic resource for researching purposes in Sentiment Analysis on Spanish tweets. The lexicon is composed by 172 positive Twitter hashtags and 127 negative Twitter hashtags.

Files of the resource:

To download the resource you have to write an email to Salud M. Jiménez Zafra (sjzafra@ujaen.es) or Eugenio Martínez Cámara (emcamara@ujaen.es).

» HEP Collection

See full content »

Resource type:

Corpora

Description:

This corpus is oriented to the study of multi-label classifiers text. It consists of scientific papers in the field of High Energy Physics (HEP – High Energy Physics) obtained by the CDS document server of European Nuclear Physics Laboratory (CERN). The corpus is divided into three subsets (called partitions), where each partition consists in two files: one containing the records of each item (with information such as the abstract, authors and, of course, classes or key words) in compressed XML format, and other that contains a plain text version of the complete paper generated from the PDF available at CERN databases (tar + gzip format). Classes are defined by the XML mark KEYWORD. These are the labels manually assigned from thesaurus DESY. You can get more information about the thesaurus DESY.

  • Partition hepth: 18,114 Theoretical Physics documents (metadata – 5,3 Mb) (papers – 226 Mb)
  • Partition hepex: 2,599 Experimental Physics documents(metadata – 1,6 Mb) (papers – 28 Mb)
  • Partition astroph: 2,716 Astrophysics documents (metadata – 1,1 Mb) (papers – 29 Mb)

Updated on 23.04.2007: Thanks to Ioannis Katakis, from Aristotle University of Thessaloniki, (Greece) por corregir algunos problemas en el XML proporcionado. How to reference This corpus has been prepared by Arturo Montejo Ráez with metadata supplied by Jens Vigen and CDS Support Team. For references use:

@Article{montejo2004,
  author =        {Montejo-Ráez, A. and Steinberger, R. and Ureña-López,  L. A.}
  title =            {Adaptive selection of base classifiers in one-against-all
                      learning for large multi-labeled collections},
  booktitle =     {Advances in Natural Language Processing: 4th International
                      Conference, EsTAL 2004},
  pages =        {1--12},
  year =           {2004},
  editor =         {Vicedo J. L. et al.},
  location =      {Alicante, Spain},
  number =      {3230},
  series =        {Lectures notes in artifial intelligence},
  publisher =    {Springer}
}

Resource files

hep-collection.rar

» iSOL

See full content »

Resource type:

Lexicon

Description:

iSOL is a list of domain independent opinion signal words in Spanish.

For the elaboration of the resource it has begun from the list of words that the professors Bing Liu maintains (Bing Liu’s Opinion Lexicon). The word list has been automatically translated using the Reverso translator and subsequently corrected manually.

The list consists of 2,509 positive and 5,626 negative words. For more information on how the list was developed see the paper: Semantic Orientation for Polarity Classification in Spanish Reviews.

Reference

If you use iSOL, please, cite the following paper:

Molina-González, M. D., Martínez-Cámara, E., Martín-Valdivia, M. T., & Perea-Ortega, J. M. (2013). Semantic orientation for polarity classification in Spanish reviews. Expert Systems with Applications, 40(18), 7250-7257.

Files of the resource:

isol.tar.gz

» MCE Corpus

See full content »

Resource type:

Corpora

Description:

MuchoCine corpus in English (MCE) is the translated version of the MuchoCine corpus (Spanish Movies Reviews). The MuchoCine corpus was developed by the researcher Fermín Cruz Mata and presented in 2008 at number 41 of the journal Natural Language Processing in the paper titled Document Classification based on Opinion: experiments with a corpus of Spanish cinema reviews.

This paper Sentiment polarity detection in Spanish reviews combining supervised and unsupervised approaches checks the validity of a methodology for polarity classification in Spanish which consists of combining three classifiers, two of them supervised (on texts in English and another language) and an unsupervised classifier using some English language resource for sentiment analysis. This methodology was previously proposed for opinions in Arabic in the paper Improving Polarity Classification of Bilingual Parallel Corpora combining Machine Learning and Semantic Orientation approaches (in press).

The polarity of the documents of the corpus are measured on a scale of 1 to 5, with 1 being very bad and 5 very good. The details of the corpus are:

Polarity Number docs.
1 351
2 923
3 1253
4 890
5 461

 

The use of this corpus is only allowed for research. In this case, you must cite the following paper:

Martín-Valdivia, M. T., Martínez-Cámara, E., Perea-Ortega, J. M., & Alfonso Ureña-López, L. (2012). Sentiment polarity detection in Spanish reviews combining supervised and unsupervised approaches. Expert Systems with Applications.

http://dx.doi.org/10.1016/j.eswa.2012.12.084

For any questions about the corpus sends an email to José M. Perea or to Eugenio Martínez Cámara

Resource files:

MCE-corpus.tar.gz