Site Tools


The concept of the token, aka running word, captures the single occurrences of each word form in a data sample, i.e. corpus or text. Each string of letters or alphanumeric characters and punctuation marks is defined as one token. If the same word form occurs more than once in a sample each occurrence is counted as one token.

Example: The woman on the hill with the telescope.

Number of tokens: 8 excluding the punctuation mark; 9 including the punctuation mark.

The process of identification of tokens in the data is called tokenisation. The process can be implemented differently for different research questions.

context concepts: lemma, type, tokenisation, type-token ratio, TTR, word form