*******************************************************************************
                                README
*******************************************************************************

This README.txt shows how to use samulan, our system for Universal, Unsupervised, Uncovered Sentiment Analysis. We provide a demo together the needed files to analyze texts both in English and Spanish. Samulan is licensed under the GNU General Public Licence (v3 or later, see also LICENSE.txt). 

If you use this software, please cite the following preprint (this reference will be updated when accepted for publication):

Vilares, D., Gómez-Rodríguez, C., & Alonso, M. A. (2017). Universal, unsupervised (rule-based), uncovered sentiment analysis. Knowledge-Based Systems, 118, 45-55.

@article{vilares2017universal,
  title={Universal, unsupervised (rule-based), uncovered sentiment analysis},
  author={Vilares, David and G{\'o}mez-Rodr{\'\i}guez, Carlos and Alonso, Miguel A},
  journal={Knowledge-Based Systems},
  volume={118},
  pages={45--55},
  year={2017},
  publisher={Elsevier}
}

The universal set of rules used to evaluate our system are located in demo/

configuration-EN.xml (English configuration)
configuration-ES.xml (Spanish configuration)
configuration-DE.xml (German configuration)

Structure of SentiData:
|- maltparser
| | *.xml parser features file
| | *.mco file (the parser itself)
| | *.conf file
|- *.tagger
| ADJ_EmotionLookupTable.txt
| ADV_EmotionLookupTable.txt
| NOUN_EmotionLookupTable.txt
| VERB_EmotionLookupTable.txt
| Emoticon_LookupTable.txt
| NegatingWordList.txt
| BoosterWordList.txt
| WordToLemmasStrippingList.txt (optional: included here for English for direct comparison to The SO-CAL from Taboada et al.)
| LemmasList.txt (optional: included for Spanish for comparison to The Spanish SO-CAL)

If the case of German, only a general EmotionLookupTable (EmotionLookupTable.txt) is available.

1. Prerequisites:

- Java 8 

2. Input file format:

- One line per sample. See some examples inside the demo folder for English (en.txt) and Spanish (es.txt)

3. Running the system

- cd to the folder where the files have been downloaded.
- Open a terminal and execute (e.g. for English texts):
  java -Dfile.encoding=UTF-8 -jar -Xmx2g samulan-0.1.0.jar -s EN-SentiData -r configuration-EN.xml -i en.txt -p samulan.properties -v true

- If you plan instead to run samulan on a parsed CoNLL file (have a look to the format and the manual):

  java -Dfile.encoding=UTF-8 -jar -Xmx2g samulan-0.1.0.jar -s EN-SentiData -r configuration-EN.xml -c en_parsed.conll -p samulan.properties -v true
 



##########################################
Updating dictionaries to a specific domain
##########################################

Given a large number of subjective files, building your own dictionary is a feasible with simple techniques such as PMI (pointwise mutual information).

- Turney, P. D. (2002, July). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 417-424). Association for Computational Linguistics.

- Mohammad, S. M., Kiritchenko, S., & Zhu, X. (2013). NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. arXiv preprint arXiv:1308.6242.

- David Vilares, Yerai Doval, Miguel A. Alonso and Carlos Gómez-Rodríguez, LyS at TASS 2014: A Prototype for Extracting and Analysing Aspects from Spanish tweets, XXX Congreso de la Sociedad Española de Procesamiento de lenguaje natural SEPLN 2014. TASS 2014 - Workshop on Sentiment Analysis at SEPLN 2014. Workshop Proceedings, Girona, Spain, 2014.

Also, you can build your own lexicon at your convenience manually or semi-automatically