Action disabled: revisions

Corpus analysis basics

tutorial status: under construction (2019-07-22)

Corpus linguistics has two principal technological mainstays:

  • the digital corpus, i.e. a principled collection of plain text data that forms the data basis of the research
  • tools for search over patterns in the data, for data sorting and display

These two form the backbone of any corpus linguistic research. Digital corpora are available from various sources. In their simplest form, corpora are comprised of sets of plain text (.txt) data, sometimes including just the text, sometimes including text plus annotation such as part of speech tagging. For the time being, we are going to work with unannotated plain text corpora. Tools for searching over patterns in the data include in their basic functionality ways of creating lists of words and their frequencies, so-called word-frequency lists. These are often useful in order to get a sense of what kind of linguistic data a corpus is comprised of and how many instances of particular words and constructions are represented in the corpus. The other many functionality is concordancing, i.e. possibility of searching for words in the corpus and displaying them together with a set context to the left and right of the search word(s). These are commonly called key-word in context (KWIC) concordances. These two principal functions, i.e. word (frequency) listing and concordancing, are often implemented in the same software because they are used hand in hand. We are going to look at AntConc as an example of a commonly used concordancing software, but be aware that there are others out there as well.

This tutorial offers a first introduction to corpus analysis. It introduces basic techniques of exploring digital corpora by means of computational tools such as AntConc. Please download a version of AntConc that is appropriate for your operating system. There is not much to install here, Antconc ships as a single executable file. All you have to do is copy it to a directory of your choice. Personally, I prefer to have a tools directory where I install all my linguistics tools so that I can use them independently of any particular class. I tend to install all my tools to

C:\Users\Public\utility

so I install AntConc to

C:\Users\Public\utility\AntConc

Next, download our corpus sampler of American Inaugural Speeches from 1961 to 2017 from moodle. Unpack it to a directory of your choice by means of a tool like 7zip or the like. I tend to hold all of my corpora in a directory with subdirectories for each corpus such as

C:\Users\Public\corpora\american_inaugural_speeches_1961_2017

After you have done that, open AntConc by double clicking on the .exe file; click File » Open and select a set of .txt files from a corpus on your hard-disk. Please make sure to avoid white space and special characters in the file or directory names as they will randomly cause errors in any processing scenario. Should you have white space and the like in directory or file names, it mostly helps to put the entire path and filename into quotation marks, but more about that later.

The AntConc main window

The AntConc main window is organized into three principal areas plus the menu bar at the top of the window what contains the drop-down menus such as the standard menu File as well as the software specific Global Settings, Tool Preferences and Help.

The area to the left underneath File is the area where the Corpus Files files you load into AntConc for analysis are displayed.

After clicking on File select Open File(s) to select individual files or Open dir … to select an entire directory of files. The default file type opened by AntConc is plain text (.txt).

Concordancing function

To the right of this there is a large area that is organized into tabs ranging from the main Concordance to Word List and Key-Word List; we will explore all of these in due course. Below this area you will find the area where you can enter a Search term. There are many options here and a few pre-sets are already made to get you started. Once you have loaded your corpus files into the Files area, you can just type in a search word in the search box and hit Start to be displayed a concordance of occurrence of the search word in the corpus loaded.

Word frequency list

Clusters / N-Grams