This is an old revision of the document!


Corpora and other language resources

Tag sets

Penn TreeBank tag set

Reference: Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19, 2 (June 1993), 313–330. Ref

pos tagdescriptionexample
CCcoordinating conjunctionand, or
CDcardinal number3, third
DTdeterminerthe, this
EXexistential therethere is
FWforeign wordtabula
INpreposition, subordinating conjunctionin, of, like
IN/thatthat as subordinatorthat
JJadjectiveblue, happy
JJRadjective, comparativebluer, happier
JJSadjective, superlativebluest, happiest
LSlist marker1)
MDmodalcould, will
NNnoun, singular or masshouse
NNSnoun pluralhouses
NPproper noun, singularCarrie
NPSproper noun, pluralAmericans
PDTpredeterminerboth as in “both the girls”
POSpossessive endingperson’s
PPpersonal pronounI, she, it
PPZpossessive pronounmy, his, your
RBadverbhowever, usually, naturally, here, good
RBRadverb, comparativebetter
RBSadverb, superlativebest
RPparticleup as in “give up”
SENTSentence-break punctuation. ! ?
SYMSymbol/ [ = *
TOinfinitive ‘to’to play
UHinterjectionaha
VBverb be, base formbe
VBDverb be, past tensewas, were
VBGverb be, gerund/present participlebeing
VBNverb be, past participlebeen
VBPverb be, sing. present, non-3dam, are
VBZverb be, 3rd person sing. presentis
VHverb have, base formhave
VHDverb have, past tensehad
VHGverb have, gerund/present participlehaving
VHNverb have, past participlehad
VHPverb have, sing. present, non-3dhave
VHZverb have, 3rd person sing. presenthas
VVverb, base formtake
VVDverb, past tensetook
VVGverb, gerund/present participletaking
VVNverb, past participletaken
VVPverb, sing. present, non-3dtake
VVZverb, 3rd person sing. presenttakes
WDTwh-determinerwhich, who
WPwh-pronounwho, what
WP$possessive wh-pronounwhose
WRBwh-abverbwhere, when
###
$$$
Quotation marks‘ “
``Opening quotation marks‘ “
(Opening brackets( {
)Closing brackets) }
,Comma,
:Punctuation– ; : — …

Corpora

corpus titlesizetimesourcelanguage
British National Corpus (BNC)100 million tokensmid 1970s - early 1990sOxfordBritish English
The Brown Corpus1 mio tokens1961ICAMEBritish English
The Lancaster/Oslo-Bergen Corpus (LOB)1 mio. tokens1961ICAMEBritish English
International Corpus of English (ICE)xxxxxxvarieties of world EnglishesInternational Corpus of English (ICE) at Zuerich, CHworld English
Mark Davies' English Corporaxxxxxxdiverse set of corporaMark DaviesAmerican English, British English, international English
Textcorpora in the DWDS div. div. https://www.dwds.de/r German
DWDS Kernkorpus 1900-1999 Berlin-Brandenburgische Akademie der Wissenschaften: https://www.dwds.de/d/korpora/kernGerman
DWDS Kernkorpus 21 2000-2010 Berlin-Brandenburgische Akademie der Wissenschaften: https://www.dwds.de/d/korpora/korpus21German
Hamburg Dependency Treebank German news site heise.de, articles published between 1996 and 2001http://hdl.handle.net/11022/0000-0000-7FC7-2German
IDS-Corpora http://www.ids-mannheim.de/kt/corpora.htmlGerman
LIMAS-Korpus1 mio words, 500 texts / fragments1970shttp://www.korpora.org/Limas/German
Arabic News Texts Corpus (AntCorpus) https://antcorpus.github.io/Arabic
Wortschatz Leipzigvarious sample sizesArabic, English, French, German, Russian misc. https://wortschatz.uni-leipzig.de/de/downloadvarious
SpråkbankenText https://spraakbanken.gu.se/en/resourcesSwedish