» OCA Corpus

See full content »

Resource type:

Corpora

Description:

OCA is an Arabic corpus of movie reviews. This corpus has been generated from comments in Arabic obtained from different web pages shown in the following table:

Name Webpage Vote system Positive Negative
Cinema Al Rasid http://cinema.al-rasid.com/ 10 36 1
Film Reader http://filmreader.blogspot.com/ 5 0 92
Hot Movie Reviews http://hotmoviews.blogspot.com 5 45 4
Elcinema http://www.elcinema.com 10 0 56
Grind House http://grindh.com 10 38 0
Mzyondubai http://www.mzyondubai.com 10 0 15
Aflamee http://aflamee.com 5 0 1
Grind Film http://grindfilm.blogspot.com/ 10 0 8
Cinema Gate http://www.cingate.net Bad/Good 0 1
Emad Ozery Blog http://emadozery.blogspot.com 10 0 1
Fil Fan http://www.filfan.com 5 81 20
Sport4Ever http://sport4ever.maktoob.com 10 0 1
DVD4ArabPos http://dvd4arab.maktoob.com 10 11 0
Gamraii http://www.gamraii.com 10 39 0
Shadows and Phantoms http://shadowsandphantoms.blogspot.com 10 0 50
Total 250 250

Some statistics of OCA corpus: This corpus was generated in October 2010 Some statistics on it are shown in the following table.:

Negative Positive
Total documents 250 250
Total tokens 94,556 121,392
Average tokens on each comment 378 485
Total sentences 4,881 3,137
Average sentences on each comment 20 13

Rushdi-Saleh, M., Martín-Valdivia, M. T., Alfonso Ureña-López, L. & Perea-Ortega, J. M. (2011). OCA: Opinion corpus for Arabic. Journal of the American Society for Information Science and Technology.
http://dx.doi.org/10.1002/asi.21598

For any questions on the corpus sends an email to Mohammed Saleh or José M. Perea

Resource files:

OCA-corpus.zip

» SFU-Review-SP-Neg

See full content »

Resource type:

Corpus

Description:

This corpus is an extension of the SFU Spanish Review Corpus (Brooke et al., 2009) with annotations about negation and its scope. It is a collection of 400 reviews of cars, hotels, washing machines, books, cell phones, music, computers and movies from the Ciao.es website. Each domain contains 25 positive and 25 negative reviews. Each review has been annotated at the token level with the lemma and the PoS and at the sentence level with negative keywords, their linguistic scope, the event and how the polarity of the sentence is affected by negation (if there is a change in the polarity or an increment or reduction of its value), also taking into account intensifiers and diminishers.

How to cite:

Jiménez-Zafra, S. M., Taulé, M., Martín-Valdivia, M. T., Ureña-López, L. A., & Martí, M. A. (2018). SFU Review SP-NEG: a Spanish corpus annotated with negation for sentiment analysis. A typology of negation patterns. Language Resources and Evaluation, 52(2), 533-569.

Jiménez-Zafra, S. M., Martín-Valdivia, M. T., Molina-González, M. D., & Ureña-López, L. A. (2018). Relevance of the SFU Review SP-NEG corpus annotated with the scope of negation for supervised polarity classification in Spanish. Information Processing & Management, 54(2), 240-251.

Jiménez-Zafra, S. M., Martin, M., Lopez, L. A. U., Marti, T., & Taulé, M. (2016). Problematic cases in the annotation of negation in Spanish. In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM) (pp. 42-48).

Martí, M. A., Martín-Valdivia, M. T., Taulé, M., Jiménez-Zafra, S. M., Nofre, M., & Marsó, L. (2016). La negación en español: análisis y tipología de patrones de negación. Procesamiento del Lenguaje Natural, 57, 41-48.

Files of the resource:

Version 1.0.0: SFU_Review_SP_Neg.zip

For any questions related to the corpus, please send an email to Salud María Jiménez-Zafra or M. Teresa Martín-Valdivia.

» SinaiSACorpus

See full content »

Resource type:

Corpora

Description:

This corpus has been prepared by the SINAI group in December 2008. SINAI SA (Sentiment Analysis) was created by tracking the Amazon website. Nearly 2,000 comments were extracted from different cameras.

Structure: The SINAI corpus containing 5 directories and each represents the number of stars for reviews. (eg directory 1 contains rated with a star). Each directory contains a file in plain text by document/comment.

The amount of comments is as follows:

    • 1…star: 78 comments
    • 2…stars: 67 comments
    • 3…stars: 97 comments
    • 4…stars: 411 comments
    • 5…stars: 1,290 comments

Total: 1,943 comments

Camera Comments
CanonA590IS 400
CanonA630 300
CanonSD1100IS 426
KodakCx7430 64
KodakV1003 95
KodakZ740 155
Nikon5700 119
Olympus1030SW 168
PentaxK10D 126
PentaxK200D 90
Total 1,943

Rushdi-Saleh, M., Martín-Valdivia, M. T., Montejo-Ráez, A., & Alfonso Ureña-López, L. (2011). Experiments with SVM to classify opinions in different domains. Expert Systems with Applications.
http://dx.doi.org/10.1016/j.eswa.2011.05.070

Resource files:

SINAI-SA-corpus.zip

» SOL

See full content »

Resource type:

Lexicon

Description:

SOL is a list of opinion signal words in Spanish independent of the domain.

For the elaboration of the resource it has begun with the list of words that maintains the professor Bing Liu (Bing Liu’s Opinion Lexicon). The list of words has been automatically translated using the translator Reverso.

The list consists of 1,397 positive and 3,151 negative words. For more information on how the list was developed see the article: Bilingual Experiments on an Opinion Comparable Corpus (in press).

Martínez-Cámara, E., Martín-Valdivia, M. T., Molina-Gonzalez, M. L. & Alfonso Ureña-López, L. (2013). Bilingual Experiments on an Opinion Comparable Corpus. Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
http://aclweb.org/anthology/W13-1612

Resource files:

sol.tar.gz

» Spanish QC

See full content »

Resource type:

Corpora

Description:

This resource are 6305 spanish tagged questions for question answering classification, following the taxonomy defined in the paper “X. Li and D. Roth. Learning Question Classifiers”, and having the following general and detailed categories:

  • ABBR: abbreviation, expansion
  • DESC: definition, description, manner, reason
  • ENTY: animal, body, color, creation, currency, disease/medical, event, food, instrument, language, letter, other, plant, product, religion, sport, substance, symbol, technique, term, vehicle, word
  • HUM: description, group, individual, title
  • LOC: city, country, mountain, other, state
  • NUM code, count, date, distance, money, order, other, percent, period, speed, temperature, size, weight

Starting from a set of labeled questions for English, it has created this resource with various questions in Spanish labeled and checked by 3 people.

García-Cumbreras, M. A., Ureña-López, L. A. & Martínez-Santiago, F. (2006). BRUJA: Question Classification for Spanish. Using Machine Translation and an English Classifier. EACL 2006 Workshop on Multilingual Question Answering – MLQA06.

Resource files:

Clasificacion-QA-6305.label_.txt

» TeCat

See full content »

Resource type:

Software

Description:

Tecat represents text categorization. It is a tool for building multi-label automatic text classifiers. With Tecat you can experiment with different collections and classifiers in order to build a multi-label classifier.

Montejo-Ráez A., Ureña-López, L. A., Steinberger, R. Adaptive Selection of Base Classifiers in One-Against-All Learning for Large Multi-labeled Collections. Lecture Notes in Computer Science Volume 3230, 2004, pp 1-12.
Please send an email to amontejo AT ujaen point es notifying its use.

License: GPL

Resource files:

tecat-0.2.tar__0.gz