This page contains the resources obtained in this research project.

  • Multilingual corpora with annotation of collocations and lexical functions: github (paper).
  • Data sets of bilingual collocations: github (paper).
  • Automatically aligned collocations in English, Portuguese, and Spanish (link).

Due to space limitations, the following resources are only available under request:

  • English, Portuguese, and Spanish word embeddings trained on large corpora (with lemma_PoS-TAG entries).
  • Cross-lingual models in these three languages (and also in French and Catalan).
  • Corpora in English, Portuguese, and Spanish (with more than one trillion tokens each) lemmatized, PoS-tagged and parsed in Universal Dependencies.