TreeTagger

TUTORIAL STATUS: workable, but under revision for updates and extensions [2020-10-06]

TreeTagger version 3.2.3 Languages: English, German, Korean

The TreeTagger is a language independent probabilistic part of speech tagger. It has been applied to the many different languages for which models exist:
English, French, German, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Latin, Estonian, Korean, Old French …

Tagging with the TreeTagger

In order to tag with the TreeTagger, the call from the terminal looks basically like this:

tree-tagger -options- <parameter file> <input file> <output file>

But read on to learn more about how it works on your operating system.

TreeTagger for Windows

Required downloads

Perl

In order to run the TreeTagger, you must install Perl on your machine. On my Windows machine, I am running

Strawberry Perl

But there are other distributions of Perl available, too, such as Active Perl from Active State or Perl from perl.org. There is a 64-bit and a 32-bit version avaiable, so you should either know or check which one is suitable for your system.

Chances are that on a modern Windows 10 machine you will be running a 64-bit Windows, so will want to download a 64-bit Perl. Best to check, so go to the control panel (German Systemsteuerung), double click on System and look it up. Under System you will find a subentry System type that will tell you whether you are running a 64-bit or a 32-bit operating system. Download the appropriate version for your operating system.

Downloading TreeTagger and models

In order to run the TreeTagger on your Windows machine, you must download the software itself plus models for tagging and/or chunking data in different languages.

Download the Windows version of the TreeTagger as either a 64-bit or 32-bit version from:

https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/#Windows

Unpack the zip-archive to the directory:

C:\TreeTagger

Note: it is possible to run the TreeTagger from other directories in your file system, but this will require some path modifications, e.g. to the TAGDIR path in the batch files for running the tagger. If you opt to do so, I trust that you know what you are doing and will be able to figure out what it takes to do to get it to run. Whatever you do, be sure that directory names in your TreeTagger paths do not contain any white space or special characters. Doing so spells sure disaster in terms of running the TreeTagger and, in fact, many other types of NLP software. If you insist to try this anyway, you are on your own.

For this tutorial, I assume that you are running the TreeTagger from C:\TreeTagger which is also what the author of the software suggests.

Next, download the parameter files for the languages you want to tag from the list of model files on the TreeTagger homepage under:

Parameter files

I suggest that as a starting point for English and German you download the following from the TreeTagger website and firmly stick with the UTF-8 files:

  • English parameter file (PENN tagset) (gzip compressed, UTF8, tagset documentation, trained on the Penn treebank)
  • English parameter file (BNC tagset) (gzip compressed, UTF8, tagset documentation, trained on the British National Corpus)
  • German parameter file (gzip compressed, UTF-8, tagset documentation)

Using 7zip or any other archiving software, unpack these to the TreeTagger sub-directory

C:\TreeTagger\lib

After unpacking, you should have files in the ..\lib subdirectory that are called:

english.par
english-bnc.par
german.par

You are now set to tag your first file. I am making available two example files here to get you started; one in English (English-text.txt) and one in German (German-text.txt) and even if you are only interested in tagging German, please start by tagging the English file first to test the tagger. I will tell you why in a minute.

Unpack the two text files to the ..\bin directory for simplicity's sake and for the demo, so you can enter a short command without the potential for errors in file paths. Start with a simple UTF-8 encoded plain text file such as the ones I am offering here:

Tagging English

Open a terminal such as the Windows Command Promt (German Eingabeaufforderung) or the Powershell. Navigate to the directory C:\TreeTagger\bin

To tag an English text, run

tag-english.bat followed by the file name of a file you want to tag and watch the result on your screen. The full command should look like this:

tag-english.bat English-text.txt

Hit RETURN after the command and watch your screen. You should now be seeing output in three columns on your screen:

tokenPOS tagLemma
PeopleNNSPeople
areVBPbe
walkingVVGwalk
theirPP$their
dogsNNSdog
underINunder
theDTthe
willowNNwillow
treesNNStree
.SENT.

Note: much like many other command line tools, the TreeTagger defaults to output to the display. If you would like to write your output to a file name, the simplest way is to use the pipe followed by an output file name to which the output is then written in the directory from which you are running the tagger unless specified otherwise. Like this:

tag-english.bat English-text.txt > English-text-output.txt

If all has gone well up to this point, you are now ready to tag data in another language, for example German.

Tagging German

Tagging German works much the same way as tagging English except that you need to use a model trained on German language data which we have already downloaded and unpacked in the ..\lib directory. So in order to tag the German-text.txt you will want to call the batch file for tagging German like this:

tag-german.bat German-text.txt

Run this command from your terminal such as Command prompt or the Powershell from the directory ..\bin. Pay close attention to the output on your screen. It may look like this:

Figure 1: TreeTagger output to a default shell

You may guess that what has happened here is a bit of a mess up with special characters such as the German Umlaut. This is due to the standard code page (think of this as the default language setting for your OS and, hence, the terminal) on your machine being set to code page 850 that does not represent Umlauts. For regular operating system stuff like software installations and logging, this will normally not be an issue, but for linguistic processing that sometimes does become a issue. Fortunately one that can be easily overcome and at any rate does not cause any damage as long as you are writing your output to a file. So when tagging text with Umlauts or other diacritics or in non-Latin writing systems, such as e.g. Korean, please write your output straight to a file. Afterwards, check that the output written is ok in the output file before embarking on larger annotation tasks on multiple files. Thus, for German, use this command:

tag-german.bat German-text.txt > German-text-output.txt

The output in the resulting file should look like this:

DasARTdie
ÖmchenNNÖmchen
führtVVFINführen
dasARTdie
HündchenNNHündchen
amAPPRARTan+die
LeinchenNNLeinchen
außerhalbAPPRaußerhalb
desARTdie
GärtchensNNGärtchen
ausPTKVZaus
.$..

If you want to fix this in the terminal because, like me, you want to demonstrate the TreeTagger live in the terminal including output written to the screen or if you run into more encoding problems, it may help to temporarily change the code page for a single terminal session. For German, you will want code page 65001.

CAUTION: Please, do NEVER EVER under any circumstances change code page settings for your operating system unless you absolutely know what you are doing and why. Changing the code page of your operating system is a sure fire way of permanently damaging your operating system to the point where you will have to reinstall! Don't say I did not warn you!

Path settings

This step is not obligatory, but recommended if you are planning to use the TreeTagger productively in the long run and for more that just 'taking a look'.

CAUTION: Do not alter any settings other than what I describe. Note: I take no responsibility for any damage you do to your system. If in doubt, ask someone who can guide you through this.

Ultimately, I suggest you add the TreeTagger to your system path so that the TreeTagger can be called from other directories. Again, there are settings that you will want to change with a bit of caution. In order to add the TreeTagger path with your system path, first of all copy the path to the binaries directory of your TreeTagger installation:

C:\TreeTagger\bin

The go to your control panel (German Systemsteuerung) » System » Advanced system settings » Environment Variables …

In the window in the lower half of the screen under System variables select the line with Path » click on Edit … and this will open up a sort of table with different system paths that are pre-set by other software installed on your operating system. Do NOT change or delete anything else here; again, you can do some serious damage to the functioning of other software. These paths are required by your system. Leave them alone.

At the top of this window with the table click in new and then enter the above mentioned TreeTagger path to the list of your paths. Click OK, close all of the dialogs and you should be good to go.

If you are attending one of my courses and would like any assistance with this, please contact my or one of the team members / tutors.

TreeTagger for MacOS X and UNIX-like OSes

Download tagger package and tagging scripts plus the installation script from the TreeTagger page. Put them into a directory /TreeTagger

Do not unpack any of the files. Navigate to this directory and execute the installation script by typing

bash install-tagger.sh

To save you from having to always type the entire TreeTagger path, add the TreeTagger sub-directories

TaggerDirectory/bin

and

TreeTaggerDirectory/cmd

to the PATH variable of the operating system. To do this type:

$ PATH=$PATH:/PathToTreeTaggerDirecory/bin:/PathToTreeTaggerDirecory/cmd $ export PATH To check that this has actually happened type:

echo $PATH