Workshop on Annotation and Query for Corpus Linguistics (AnQCor)

Annotation and query are a mainstay of corpus linguistic research. Corpus studies can benefit greatly from high-quality annotation. It is a means of making linguistic expertise and intuitions explicit and thus accessible to the analyst and to further steps in the research process. Yet, in order to harness the linguistic knowledge provided by corpora and annotations, linguists must also be able to formulate or operationalise their linguistic questions and address them to the corpus.

This workshop offers students of linguistics and other philologies a hands-on tutorial in annotation and query techniques for the analysis of digital corpora. In the workshop, we are going explore techniques of automatic and semi-automatic annotation together with matching techniques for querying corpus data. Students are going to learn to employ the Stanford NLP Part of Speech Tagger[1] as an example of an automatic annotation process they can use to annotate their own corpora. We are furthermore going to study the structure of linguistic data in plain text and annotated corpora and the resulting affordances for operationalisations of linguistic research questions in the form of queries. Query techniques to be explored in the workshop include the integrated search for linguistic forms and annotations based on the use of regular expressions as implemented in standard tools such as text editors (e.g. Notepad++[2]) and the widely used concordancing software such as AntConc[3]. Working examples in the workshop are from a corpus of political speeches.

Tools used in the workshop:

Tool tutorials

linguisticsweb.org offers tutorials for the tools mentioned above. Please make sure you consult those in order to learn more about usage, settings and options:

Workshop files

There are some files you will need to download in order to use the techniques taught in the workshop:

Additonal requirement:

The Stanford Tools are based on Java and thus require a Java SE Development Kit 8 (not just the Java Runtime Environment (JRE) which is installed on many machines by default). You can download a current version from the Oracle Java website under the following link:

Oracle Java SE Development Kit 8

  • 32-bit version for 32-bit Operating Systems (choose this one if in doubt, it works fine on 64-bit systems, too)
  • 64-bit version for 64-bit Operating Systems
  • remember to agree to the licence agreement at the top of the table