Corpus encoding formats

Status of the tutorial: in progress


The Text Encoding Initiative (TEI) works towards guidelines (Sperberg-McQueen & Burnard 1994) for a standardisation of encoding formats for text interchange.

Originally, the TEI Guidelines were implemented in the Standard Generalized Mark Up Language (SGML); later implementations build on the eXtensible Markup Language (XML).

In the TEI Guidelines, an individual text or document is made up of two parts:

A header and the text itself; these two parts can either be stored in a single file or as two separate interlinked or referenced files.

The header contains meta-information about the text such as the author, title, year, edition etc. as well as information about encoding practices etc.

Two devices are employed within header an text: tags and entity references

Texts are assumed to be made up of elements such as chapters, paragraphs, sentences, words etc.

  • XCES