corpus

Latin word for body, in principle any collection of more than one text. In the context of modern linguistics, a principled collection of natural language material. Can contain samples or whole texts, e.g. sentences, conversations, or books. Often enriched with additional information like annotations ( POS-tags, Parsing , etc. ). The primary purpose of a corpus is not to preserve and display text as in an archive or a library; a corpus is a collection with a particular linguistic purpose in mind. Corpora can be employed for research in all fields of linguistics , e.g. lexicology , sociolinguistics, lexicography, discourse studies or applied linguistics.

Typical characteristics :

  • finite size of the individual texts ( description as sample corpus ) , exception : monitor corpus
  • sampling and representativeness of some language or text type
  • machine-readable form , stored as an electronic database

Types of corpora:

  • Diachronic Corpus
  • Synchronic Corpus
  • Written Corpus
  • Spoken Corpus
  • Multilingual Corpus
  • Parallel Corpus
  • Learner Corpus
  • Development Corpus
  • General Corpus
  • Specialized Corpus