STATUS: under construction
In many cases, the first step in corpus building entails cleaning up and normalizing texts which come in unsuitable formats or in formats that do not fit the processing pipeline we have in mind. This may be due to the fact that texts are gathered from diverse sources or in diverse formats, or that texts come in formats such as html, pdf, rtf, docx or any other of the multitude of file types and formats available on modern computer systems. Typically, this is the starting point when planning to build bespoke corpora for a specific research undertaking. In my team, we are calling this process 'textputzing'[1].
The process usually requires the following steps:
[1] The term 'textputzing' was coined in the course of a rather extensive corpus cleaning exercise.