» SinaiSACorpus

See full content »

Resource type:

Corpora

Description:

This corpus has been prepared by the SINAI group in December 2008. SINAI SA (Sentiment Analysis) was created by tracking the Amazon website. Nearly 2,000 comments were extracted from different cameras.

Structure: The SINAI corpus containing 5 directories and each represents the number of stars for reviews. (eg directory 1 contains rated with a star). Each directory contains a file in plain text by document/comment.

The amount of comments is as follows:

    • 1…star: 78 comments
    • 2…stars: 67 comments
    • 3…stars: 97 comments
    • 4…stars: 411 comments
    • 5…stars: 1,290 comments

Total: 1,943 comments

Camera Comments
CanonA590IS 400
CanonA630 300
CanonSD1100IS 426
KodakCx7430 64
KodakV1003 95
KodakZ740 155
Nikon5700 119
Olympus1030SW 168
PentaxK10D 126
PentaxK200D 90
Total 1,943

Rushdi-Saleh, M., Martín-Valdivia, M. T., Montejo-Ráez, A., & Alfonso Ureña-López, L. (2011). Experiments with SVM to classify opinions in different domains. Expert Systems with Applications.
http://dx.doi.org/10.1016/j.eswa.2011.05.070

Resource files:

SINAI-SA-corpus.zip

» SMART

See full content »

Resource type:

NLP and IR Software

Description:

Salton’s Magic Automatic Retriever of Text. Information Retrieval System which was conceived as a tool for evaluating the effectiveness of many types of analysis and search procedures. It incorporates three different methods of analysis of language: word, lemma and thesaurus.

Resource link:

» SOL

See full content »

Resource type:

Lexicon

Description:

SOL is a list of opinion signal words in Spanish independent of the domain.

For the elaboration of the resource it has begun with the list of words that maintains the professor Bing Liu (Bing Liu’s Opinion Lexicon). The list of words has been automatically translated using the translator Reverso.

The list consists of 1,397 positive and 3,151 negative words. For more information on how the list was developed see the article: Bilingual Experiments on an Opinion Comparable Corpus (in press).

Martínez-Cámara, E., Martín-Valdivia, M. T., Molina-Gonzalez, M. L. & Alfonso Ureña-López, L. (2013). Bilingual Experiments on an Opinion Comparable Corpus. Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
http://aclweb.org/anthology/W13-1612

Resource files:

sol.tar.gz

» SOM_PAK

See full content »

Resource type:

Machine Learning and Data Mining Software

Description:

Software package for kohonen Self Organizing Maps. Implementation of the Kohonen algorithm, used for different applications: clustering, visualization, classification, interpolation function, vector quantization … For Windows and Unix. Unknown License

Resource link:

» Spanish QC

See full content »

Resource type:

Corpora

Description:

This resource are 6305 spanish tagged questions for question answering classification, following the taxonomy defined in the paper “X. Li and D. Roth. Learning Question Classifiers”, and having the following general and detailed categories:

  • ABBR: abbreviation, expansion
  • DESC: definition, description, manner, reason
  • ENTY: animal, body, color, creation, currency, disease/medical, event, food, instrument, language, letter, other, plant, product, religion, sport, substance, symbol, technique, term, vehicle, word
  • HUM: description, group, individual, title
  • LOC: city, country, mountain, other, state
  • NUM code, count, date, distance, money, order, other, percent, period, speed, temperature, size, weight

Starting from a set of labeled questions for English, it has created this resource with various questions in Spanish labeled and checked by 3 people.

García-Cumbreras, M. A., Ureña-López, L. A. & Martínez-Santiago, F. (2006). BRUJA: Question Classification for Spanish. Using Machine Translation and an English Classifier. EACL 2006 Workshop on Multilingual Question Answering – MLQA06.

Resource files:

Clasificacion-QA-6305.label_.txt

» SVM-Light

See full content »

Resource type:

Machine Learning and Data Mining Software

Description:

Classifier based on Support Vector Machines. Implemented in C. Free for scientific use

Resource link:

» TeCat

See full content »

Resource type:

Software

Description:

Tecat represents text categorization. It is a tool for building multi-label automatic text classifiers. With Tecat you can experiment with different collections and classifiers in order to build a multi-label classifier.

Montejo-Ráez A., Ureña-López, L. A., Steinberger, R. Adaptive Selection of Base Classifiers in One-Against-All Learning for Large Multi-labeled Collections. Lecture Notes in Computer Science Volume 3230, 2004, pp 1-12.
Please send an email to amontejo AT ujaen point es notifying its use.

License: GPL

Resource files:

tecat-0.2.tar__0.gz

» TextGarden

See full content »

Resource type:

Machine Learning and Data Mining Software

Description:

Set of software tools for supervised and unsupervised classification, web mining, visualization, etc.. Written in C++, running on Windows and GNU / Linux via Wine. License undetermined, freely usable for research

Resource link:

» TIMBL-5.1

See full content »

Resource type:

Machine Learning and Data Mining Software

Description:

Decision trees implementation based on KNN classifier. Package includes IB1, IB2, TRIBL, TRIBL2 and IGTree algorithms, and provides several weight metrics.
Python-TiMBL language. License freely available for research and education.

Resource link:

» TnT

See full content »

Resource type:

NLP and IR Software

Description:

Part of speech tagger for tasks of natural language processing. Optimized for speed and training in a wide variety of documents. Free License Agreement for nonprofit research.

Resource link:

» TREC_EVAL

See full content »

Resource type:

NLP and IR Software

Description:

Text Retrieval Conference. Standard tool used by the TREC community to evaluate ad hoc retrieval runs, giving a result file and a standard set of known results.

Resource link:

» Treetagger

See full content »

Resource type:

NLP and IR Software

Description:

Part of Speech Tagger. Executable for Sparc workstations, Linux and Windows PCs and Macs. Free distribution

Resource link:

» WEKA

See full content »

Resource type:

Machine Learning and Data Mining Software

Description:

Java Toolkit for data mining and machine learning. The algorithms can be applied directly to a dataset or called from your own Java code.
Weka contains tools for preprocessing, classification, regression, clustering, association rules, and visualization. GPL License

Resource link:

» Wikipedia XML Corpus

See full content »

Resource type:

Machine Learning and Data Mining Software

Description:

Corpus of Wikipedia articles in several languages ​​oriented to different tasks: classification, text retrieval, multimodal, etc. GNU Document Licence

Resource link:

» XELOPES

See full content »

Resource type:

Machine Learning and Data Mining Software

Description:

Library for data mining. It has versions in C++ and Java. Exists a GPL version

Resource link:

Related links: