Corpora

This section offers a list of corpora with descriptions.

Corpus Availability Language(s) Size Annotations Time Source
DWDS-Kernkorpus browseable via DWDS GER 100 Mio POS-Tagging, lemma 1900-2000 DWDS
Juilland-D-Korpus browseable via DWDS GER 1/2 Mio POS-Tagging, lemma 1920-1939 DWDS
DDR-Korpus browseable via DWDS GER 9 Mio POS-Tagging, lemma 1949-1990 DWDS
C4-Korpus browseable via DWDS GER 25.8 Mio POS-Tagging, lemma 1920-1939 DWDS
Berliner Zeitung-Korpus browseable via DWDS GER 252 Mio POS-Tagging, lemma 1994-2005 DWDS
ZEIT-Korpus browseable via DWDS GER 460 Mio POS-Tagging, lemma 1946-2009 DWDS
Der Tagesspiegel-Korpus browseable via DWDS GER 170 Mio POS-Tagging, lemma 1996-2005 DWDS
Potsdam Neueste Nachrichten-Korpus browseable via DWDS GER 15 Mio POS-Tagging, lemma 2003-2005 DWDS
Korpus jüdischer Periodika browseable via DWDS GER 26 Mio POS-Tagging, lemma 1887-1938 DWDS
Wendekorpus browseable via DWDS GER 281471 POS-Tagging, lemma 1993-1996 DWDS
Falko browseable via Annis2 GER POS-Tagging, lemma, learner errors since 2004 Humboldt University
Deutsches Referenzkorpus (!DeReKo) browseable GER 4 GB since 1960s IDS Mannheim
Tiger Treebank TUD-linglit GER 900k IMS Stuttgart
Negra GER 355k POS-Tagging (STTS) Uni Saarland
SALSA GER 700k Uni Saarland
TüBa D/Z GER 976k 1986-1999 Uni Tübingen
Penn Treebank US$3150 EN 1 Mio LDC
British National Corpus (BNC) TUD-lingLit EN-UK 100 Mio POS-Tagging (CLAWS) ~ 1980s Univ. Oxford
Corpus of Contemporary American English (COCA) browseable EN-US 410 Mio POS-Tagging 1990-2010 Brigham Young University
Corpus of Historical American English (COHA) browseable EN-US POS-Tagging 1810-2009 Brigham Young University
American National Corpus (ANC) US$75 EN-US 22 Mio post-1989 LDC
Open American National Corpus (OANC) free of charge EN-US 15 Mio post-1989 ANC
North American News Text Corpus US$300 EN-US 350M 1994-1997 LDC
EuroParl free of charge DE, EN + 44M, 50M 1996-2009 Philipp Koehn
JRC-ACQUIS free of charge DE, EN + 32M, 34M JRC
PukWaC free of charge EN 2G WaCKy Initiative
deWaC free of charge GER 1.7G WaCKy Initiative
ukWaC free of charge EN 2G WaCKy Initiative
WaCkypedia_EN free of charge EN 800M WaCKy Initiative
Google !Web1T n-grams TUD-UKP DE, EN LDC
Vienna-Oxford International Corpus of English (VOICE) free of charge EN 162.3 MB VOICE mark-up, spelling 2001-2007 Voice project, University of Vienna
VU Amsterdam Metaphor Corpus free of charge EN 33 MB metapors ~ 1980s University of Amsterdam
Speech, Thought and Writing Presentation Corpus (STWP) free of charge via OTA EN 260000 SW&TP 20th cent. STWP, University of Lancaster
GerManC. A Historical Corpus of German Newspapers 1650-1800 free of charge via OTA GER 9000 1650-1800 University of Manchester
The Lancaster Newsbooks Corpus free of charge via OTA EN 7.57 MB XML 17th cent. Tony McEnery, Andrew Hardie, University of Lancester
French Learner Language Oral Corpora (FLLOC) free of charge FR 20 MB CHAT transcription format Florence Myles University of Newcastle
Hamburg Corpus of Polish in Germany (!HamCoPoliG) free of charge PL HIAT transcription format 2008-2011 University of Hamburg
Dortmunder Chat-Korpus browseable GER 1.06 Mio XML Angelika Storrer and Michael Beißwenger
BMW-Forum-Korpus browseable GER tokenizing Anke Lüdeling and Julia Richling
Polytechnic of Wales Corpus ICAME EN-UK 65.000 tokens SFL 1978-1984 Robin Fawcett, Michael Perkins
Darmstädter Korpus Deutscher Fachsprachen browseable GER 4 mio. tokenized, PoS tagged 1990s Leslie L. Siegrist, Sabine Bartsch