The Stanford POS Tagger

author: Sabine Bartsch, Technische Universität Darmstadt

Tutorial builds on software and input from the Stanford PoS Tagger website.

Related tutorial: Stanford PoS Tagger: tagging from Python

1 What is the Stanford PoS Tagger?

The Stanford PoS Tagger is a probabilistic Part of Speech Tagger developed by the Stanford Natural Language Processing Group. It is widely used in state of the art applications in natural language processing.

The Stanford PoS Tagger is an implementation of a log-linear part-of-speech tagger. It is effectively language independent, usage on data of a particular language always depends on the availability of models trained on data for that language. Tagging models are currently available for English as well as Arabic, Chinese, and German. They ship with the full download of the Stanford PoS Tagger.

2 Installation and requirements

Requirements: The Stanford PoS Tagger requires Java. As many programmes in corpus and computational linguistics require Java and as Java is used widely in this field, it is advisable to install the full Java JDK (Java Development Kit) which includes also the JRE (Java Runtime Environment). Please consult the following page to download software that is a system prerequisite for many corpus and computational linguistic applications: Open JDK.

The Stanford PoS Tagger does not require much of an installation. The following steps get you started in no time at all.

  1. Download the latest version from the following website: http://nlp.stanford.edu/software/tagger.shtml
  2. There are two download versions available, the basic English Stanford Tagger version 4.x.x and the full version of the Stanford Tagger version 4.2.x including additional models for English as well as models for Arabic, Chinese, French, Spanish, and German
  3. Unzip the .zip archive to a directory of your choice. Please make sure that the directory name contains no white space and that the path is not too long as this can cause problems keeping track of files and making backup copies.
  4. File locations: It is advisable to decide on a location for your linguistics tools. In my case, I have long decided to put any tools that are not automatically installed under the default OS location to c:\Users\Public\utility\… with subfolders for each tool (and version) at this location. I am finding this convenient as it keeps the directory path relatively short.
  5. The Stanford PoS Tagger requires a number of start up parameters that call up its Java environment as well as the tagger, point to resources required for processing different languages and read in and output different data formats. These are best stored in a batch file for later modification.
  6. The Stanford PoS Tagger also comes with a very simple Graphical User Interface that allows you to test its basic functionality. It is not intended for productive use, but you can part of speech tag an individual sentence to get a feel for the functionality.

3 Sample annotation project

In order to invoke the part of speech tagger, the following generic commandline parameters have to be supplied:

java -mx500m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model NAME-OF-MODEL -textFile infile.txt > outfile.txt

3.1 Input parameters

java invokes the Java runtime environment
-mx numerical value that assigns memory to the tagger; 500m equals 500 megabytes which should sufficient for most tagging tasks
-classpath path to the tagger program proper
-model different taggers are available, but at one has to be specified: e.g. edu.stanford.nlp.tagger.maxent.MaxentTagger
-textFile for plain text input files
-xmlInput Example value: <body>; The value specified here determines the element of an xml file the contents of which is being tagged.
-outputFormat xml, tsv, slashTags, -tagSeparator \#

3.2 Example commands for different purposes

These commands are formatted into different lines in order to make them more readable. Please type them into your DOS-box or shell as one single line. It is a good idea to copy these commands into an editor as a single line and save it as a plain text file with the filename extension .bat (Windows) or .sh (Linux) in order to make the file executable. Writing your commands into a so-called batch-file makes it easier to modify the commands and to fix errors in case you have mistyped anything. Sample batch files are available here for download. Note that you have to modify the names of the input file to point to a file available in your computer and the output file to a filename of your choice. It is assumed that the input file is located in the base directory of the Stanford PoS Tagger. If your input file is located in another directory, be sure to specify the full path; the same applies to the output file.

3.2.1 How to tag an English plain text file and write output to a plain text file

You can test the tagger by tagging the file “sample-inout.txt” that ships with the tagger and is located in the tagger directory. Use the following command to do so:

java -mx500m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\english-left3words-distsim.tagger” -textFile “sample-input.txt” > “my-sample-output.txt”

For future use, copy the command to a plain text file and save it under the name: my-stanford-pos.bat. You can then run this command from this batch file in the terminal.

The next example shows how you can pos tag any other file in your file system. In this case,

java -mx500m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\english-left3words-distsim.tagger” -textFile “C:\Users\Public\corpora\BarackObamaSpeeches\OSC2002-2009\P-Obama-Inaugural-Speech-Inauguration.htm.txt” > “C:\Users\Public\corpora\BarackObamaSpeeches\OSC2002-2009\P-Obama-Inaugural-Speech-Inauguration-out.txt”

Please note: you need to copy the file stanford-postagger.bat to your Stanford PoS Tagger directory and make sure the input file is located in the same directory or specify the path to the file as in the Obama Inauguration example above.

CAUTION: Should you decide to copy and paste the above command into your terminal or your own batch file, please make sure that everything is on one single line and there are no line-breaks. Also ensure that the quotation marks are not turned into “curly” typographic quotation marks (see References below for more on this) when you copy and paste; this will sometimes happen depending on your combination of browser and editor. If it does happen, make sure you overwrite them in your editor with simple quotation marks, then save the file.

Note: your text editor may well be showing this call on two lines without actually inserting a line break, but simple visually breaking the line at the window border, so it may look like there is more than one line when in fact there technically is not another line.

3.2.2 How to tag an xml input file and write output to an xml output file with a model for English

java -mx300m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\english-left3words-distsim.tagger” -textFile xmlIn.xml > outfile.xml -outputFormat xml -xmlInput body

4 Multilingual part of speech tagging

Different tagging models are available for the following languages:

  • Arabic
  • Chinese
  • English
  • German

In order to tag texts in a different language, select a different model from the \models folder. There are a variety of models available with the tagger both for English and the other languages mentioned above. Additionally, the tagger can be trained for other languages.

Tag sets for different languages

Please note that for different languages the tagger uses different tag sets as there is no universal tag-set that fits all linguistic phenomena in all languages. Make sure you find out what tag-set is being used in a model for a specific language and what the tags mean.

4.1 Tagging German data

In order to use the Stanford PoS tagger to tag German plain text, all you have to do is change the model to one of the models for German, e.g. “\models\german-fast.tagger”, “\models\german-hgc.tagger” or “\models\german-ud.tagger”and of course adjust the names of the input and output files.

4.1.1 Tagging German plain text with the ''german-fast.tagger'' model

java -mx300m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\german-fast.tagger” -textFile “goethe-faust-1.txt” > “goethe-faust-1.out”

Note that this is only possible if you have the “\models\german-fast.tagger” in the subfolder “\models”.

4.1.2 Tagging with the ''german-ud.tagger'' model

Alternatively, with the new version of the Stanford PoS Tagger (2022) you can use the “german-ud.tagger” model which is bigger than the “german-fast.tagger” and the following command:

java -mx300m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\german-fast.tagger” -tokenize False -textFile sample-de-input.txt > sample-de-output.txt -encoding “utf-8”

4.2 Tagging Arabic data

Tagging data in the Arabic language works according to the same pattern as for English and German. All you have to do is select an input file with Arabic language data and specify a model for the Arabic language from the models\ folder:

java -mx1000m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “models\arabic.tagger” -textFile sample-ar-input.txt > sample-ar-input.pos

5 Summary

The Stanford PoS Tagger is an easy-to-use Part of Speech Tagger which can be installed easily and which is usable for free. It is language independent; models for different languages are available and the tagger can be trained on new data. The Stanford PoS Tagger is used in state of the art applications.

References

Website for the Stanford PoS Tagger by the Stanford NLP Group Stanford log-linear part of speech tagger

Butterick's Practical Typography on Straight and curly quotes