This is an old revision of the document!


Stanford PoS Tagger: tagging from Python

author: Sabine Bartsch, e-mail: mail@linguisticsweb.org

[tutorial status: work in progress - January 2019]

Related tutorial: Stanford PoS Tagger

While we will often be running an annotation tool in a stand-alone fashion directly from the command line, there are many scenarios in which we would like to integrate an automatic annotation tool in a larger workflow, for example with the aim of running pre-processing and annotation steps as well as analyses in one go. In this tutorial, we will be running the Stanford PoS Tagger from a Python script.

The Stanford PoS Tagger is itself written in Java, so can be easily integrated in and called from Java programs. However, many linguists will want rather stick with Python as their programming language, especially when they are using other Python packages such as NLTK as part of their workflow. And while the Stanford PoS Tagger is not written in Python, it can nevertheless be more or less seamlessly integrated into Python programs. In this tutorial, we will be looking at two principal ways of driving the Stanford PoS Tagger from Python and show how this can be done with singular and multiple files in a directory.

Running the Stanford PoS Tagger in NLTK

NLTK integrates a version of the Stanford PoS tagger as a module that can be run without a separate local installation of the tagger. This is the simplest way of running the Stanford PoS Tagger from Python. It has, however, a disadvantage in that users have no choice between the models used for tagging. This is, however, a good way of getting started using the tagger. The script below gives an example of a script using the Stanford PoS Tagger module of NLTK to tag an example sentence:

Note the for-loop in lines 15-16 that converts the tagged output (a list of tuples) into the two-column format: word_tag.

This same script can be easily modified to tag a file located in the file system:

Note that you need to adjust the path in line 8 above to point to a UTF-8 encoded plain text file that actually exists in your local file system.

Driving the Stanford PoS Tagger local installation from Python / NLTK

Instead of running the Stanford PoS Tagger as an NLTK module, it can be driven through an NLTK wrapper module on the basis of a local tagger installation. In order to make use of this scenario, you first of all have to create a local installation of the Stanford PoS Tagger as described in the Stanford PoS Tagger tutorial under 2 Installation and requirements. In the code itself, you have to point Python to the location of your Java installation:

java_path = “C:/Program Files/Java/jdk1.8.0_192/bin/java.exe”

os.environ[“JAVAHOME”] = java_path

You also have to explicitly state the paths to the Stanford PoS Tagger .jar file and the Stanford PoS Tagger model to be used for tagging:

jar = “C:/Users/Public/utility/stanford-postagger-full-2018-10-16/stanford-postagger.jar”

model = “C:/Users/Public/utility/stanford-postagger-full-2018-10-16/models/english-bidirectional-distsim.tagger”

Note that these paths vary according to your system configuration. You will need to check your own file system for the exact locations of these files, although Java is likely to be installed somewhere in C:\Program Files\ or C:\Program Files (x86) in a Windows system.

Running the local Stanford PoS Tagger on a sample sentence

The next example illustrates how you can run the Stanford PoS Tagger on a sample sentence:

Running the local Stanford PoS Tagger on a single local file

The code above can be run on a local file with very little modification. In this example, the sentence snippet in line 22 has been commented out and the path to a local file has been commented in:

Running the local Stanford PoS Tagger on a directory of files

Please note down the name of the directory to which you have unpacked the Stanford PoS Tagger as well as the subdirectory in which the tagging models are located. Also write down (or copy) the name of the directory in which the file(s) you would like to part of speech tag is located. As we will be writing output of the two subprocesses of tokenization and tagging to files in your file system, you have to create these output directories in your file system and again write down or copy the locations for further use. In this example these directories are called:

data_path

tokenized_data_path

tagged_data_path

Once you have installed the Stanford PoS Tagger, collected and adjusted all of this information in the file below and created the respective directories, you are set to run the following Python program: