Jesus Vilares' NLP & IR bookmarks
GENERAL PURPOSE RESOURCES (mainly)
Tools: Taggers, Parsers, NER, NP
chunking, Language models, Concordances, Summarization,
Other
Corpora:
Large collections, Particular languages, Treebanks,
Discourse,
WSD,
Literature,
Acquisition
SGML/XML
Dictionaries
Lexical/morphological resources
Courses, Syllabi, and other Educational
Resources
Mailing lists
Other stuff on the Web:
General, IR, IE/Wrappers, People, Societies
Kenji Kita's
personal page
- website of Kenji Kita (Center
for Advanced Information Technology, University of Tokushima)
- extremely complete list of Speech
and Language Resources including:
- Universities and Academic Sites
- Natural Language Processing
- Software Tools for NLP
- Lexical Resources, Dictionaries
- Corpora, Text Resources
- Speech Databases
- Text Encoding
- Speech Recognition
- Speech
Synthesis
- Language and Linguistics
- Computer Assisted Language Learning
- Information Retrieval
- Personal Pages of Researchers
- Web Resources in Japan (in Japanese)
- Chinese Language Processing
- etc.
- much complete set of bookmark webpages about:
- Corpora, Collections, Data
Archives
(mainly English)
- Non-English
Corpora
- Courses, FAQs, Info,
E-lists, Standards
- Software, Tools, Freq lists
- References, Papers, Journals
- Teaching & Misc Links
- People, Places &
Conferences Index
- etc.
- Papers
- Servers (i.e. demos)
- Software
- Corpora (English only?)
- Courses
MULTILINGUAL RESOURCES (mainly)
Multext
tools
- MULTEXT Project
resources (CNRS)
- Multilingual text editor
- SGML manipulation tools
- Text segmentation tools
- Morpho-lexical tools (lexical/PoS/morphological tools)
- Multilingual text aligner
- Speech processing tools
- Multilingual string libraries
- Character conversion
- see also ISSCO
downloads
- EUconst - The
European constitution
- OO - the
OpenOffice.org corpus
- KDE - KDE system
messages
- KDEdoc - the KDE
manual corpus
- PHP - the PHP manual
corpus
- EUROPARL -
European Parliament Proceedings 1996-2003
- free downloadable NLPs tools:
- sentence-aligner, tokenizers, sentence-splitters, PoS
taggers, lemmatizers, chunkers, XML-tools, encoding managers
- English, German, Swedish, French, Italian, Japanese, Portuguese
- free downloadable corpus (ncluding composition statistics)
- free downloadable NLPs tools:
- raw-text extractor (i.e. HTML parser), sentence-splitter,
sentence-aligner
University
of Maryland Parallel Corpus Project
- multilingual Bible:
- Cebuano, Chinese, Danish, English (outdated), Finnish, French,
Greek, Indonesian, Latin, Spanish, Swahili, Swedish,
Vietnamese
- web-based bilingual parallel corpora (url database):
- downloaded with STRAND system
- Japanese/Chinese/French/Arabic/Basque-English
Parallel Corpora in Uppsala
- parallel corpora projects in the Language Engineering group
(Språkteknologi) of the Linguistic department of the University of
Uppsala in Sweden
- Parallel corpora (including nordic languages)
- Parallel corpora around the world
- Complete non-English, parallel & multilingual corpora
resources
Free
Online Dictionaries
- on-line bilingual dictionaries
- English «--»
Afrikaans/Danish/Dutch/Finnish/French/Hungarian/Indonesian/Italian/Japanese/Latin
/Norwegian/Portuguese/Russian/Spanish/Swahili/Swedish
Wordlists
- Word lists in all type of languages
The Bible Tool
- much
complete site about Bible:
- multilingual Bible
- on-line displaying
- Bible-related freely-available software (Windows, Linux, Mac,
iPAQ, etc.)
- related links
- researcher in Natural Language Processing, Machine Learning, and
Information Retrieval
- resources for:
Pascale Fung's personal
page
- website of Pascale Fung (Dept. of Electrical and Electronic
Engineering, Hong Kong Univ. of Science and Technology)
- research about comparable and
quasi-comparable corpora
Pamela
Forner's personal page
- website of Pamela Forner (Cognitive and
Communication Technologies-TTC
division)
- multilingual corpora resources
- parallel corpora (classified acording to degree of availability)
- multilingual corpora
- projects
- institutions
- tools: concordancers, sentence/word aligners
- site of Emily M. Bender (Department of Linguistics, University of
Washington)
- corpus
resources (downloads and links): courses, web tools, corpora,
sites, sofware, conferences, standards, etc.
- extremely
complete resource page
- general resources
- corpora and corpus linguistics.
- multilingual and parallel corpora.
- electronic literary text archives.
- references, standards & educational resources
- tools
- localized resources for dozens of
languages:
- Afrikaans - Albanian - Albanian (Caucasic) - Arabic - Armenian
-
Australian lgs. - Awabakal (Yuin-Kuric) - Azerbaijani - Barbadian
(Creole English) - Basque - Bengali - Berbice (Creole Dutch) -
Bulgarian - Catalan - Chinese (incl. Cantonese) - Chiricahua (Apache) -
Commonwealth Antillean Creole French - Commonwealth Winward Islands
Creole English - Czech - Danish - Dutch - English (Modern) - English
(Old & Middle) - Esperanto - Estonian - Farsi - Finnish - French -
French Antillean Creole French - Frisian - Gaelic - Georgian - German -
Gothic - Greek (Classic and Modern) - Gujarati - Gulf of Guinea Creole
Portuguese - Guyana Creole English - Guyanais (Creole French) - Haitian
(Creole French) - Hebrew - Hindi - Hungarian - Icelandic (incl. Old
Norse) - Indoeuropean - Indonesian - Irish (incl. Ogamic, Old &
Middle Irish) - Italian - Jamaican Creole English - Japanese - Karelian
- Korean - Krio (Sierra Leone Creole English) - Kru (Liberian Pidgin
English) - Latin - Latvian - Leeward Islands Creole English -
Lithuanian - Livonian - Louisiana Creole French - Macaísta (Macau
Creole Portuguese) - Malay - Maltese - Mambila - Manx - Mari (Eastern
Meadow) - Mauritian Creole (Isle de France CF) - Mescalero (Apache) -
Miskito Creole English - Mitchif (French-Cree mixed language) - Nahuatl
- Neapolitan - Negerhollands (Creole Dutch) - Norwegian - Occitan -
Palenquero (Creole Spanish) - Panjabi - Polish - Portuguese (incl.
Brazilian & Galego-Portuguese) - Romanian - Russian - Sardinian -
Saxon (Old) - Scots - Serbo-Croate - Singhalese - Slavonian (Old Church
Slavonian) - Slovak - Slovene - Spanish - Sumerian - Swahili - Swedish
- Tagalog - Taino - Tamil - Tetun (East Timorese) - Thai - Tibetan -
Tocharian (A & B) - Tok Pisin (Creole English) - Turkish -
Ukrainian - Upper Guinea Creole Portuguese - Urdu - Uzbek - Veps -
Vietnamese - Virgin Islands Creole English - Welsh - West African
Pidgin English.
CLEF (Cross Language Evaluation Forum)
- The European TREC
- selected links
about:
- other evaluation forums
- freely available resources:
- ODS (United Nations Official
Documents): parallel documents in
Arabic, Chinese, English, French, Russian and Spanish
- SDA: German/French/Italian
- on-line translators
- on-line dictionaries
-
ARCADE
- Conference about evaluation of
parallel text alignment systems