Tutorial builds on software and input from the Stanford PoS Tagger website.
Related tutorial: Stanford PoS Tagger: tagging from Python
The Stanford PoS Tagger is a probabilistic Part of Speech Tagger developed by the Stanford Natural Language Processing Group. It is widely used in state of the art applications in natural language processing.
The Stanford PoS Tagger is an implementation of a log-linear part-of-speech tagger. It is effectively language independent, usage on data of a particular language always depends on the availability of models trained on data for that language. Tagging models are currently available for English as well as Arabic, Chinese, and German. They ship with the full download of the Stanford PoS Tagger.
Requirements: The Stanford PoS Tagger requires Java. As many programmes in corpus and computational linguistics require Java and as Java is used widely in this field, it is advisable to install the full Java JDK (Java Development Kit) which includes also the JRE (Java Runtime Environment). Please consult the following page to download software that is a system prerequisite for many corpus and computational linguistic applications: Open JDK.
The Stanford PoS Tagger does not require much of an installation. The following steps get you started in no time at all.
c:\Users\Public\utility\…
with subfolders for each tool (and version) at this location. I am finding this convenient as it keeps the directory path relatively short.In order to invoke the part of speech tagger, the following generic commandline parameters have to be supplied:
java -mx500m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger
-model NAME-OF-MODEL
-textFile infile.txt > outfile.txt
java | invokes the Java runtime environment |
-mx | numerical value that assigns memory to the tagger; 500m equals 500 megabytes which should sufficient for most tagging tasks |
-classpath | path to the tagger program proper |
-model | different taggers are available, but at one has to be specified: e.g. edu.stanford.nlp.tagger.maxent.MaxentTagger |
-textFile | for plain text input files |
-xmlInput | Example value: <body>; The value specified here determines the element of an xml file the contents of which is being tagged. |
-outputFormat | xml, tsv, slashTags, -tagSeparator \# |
These commands are formatted into different lines in order to make them more readable. Please type them into your DOS-box or shell as one single line. It is a good idea to copy these commands into an editor as a single line and save it as a plain text file with the filename extension .bat
(Windows) or .sh
(Linux) in order to make the file executable. Writing your commands into a so-called batch-file makes it easier to modify the commands and to fix errors in case you have mistyped anything. Sample batch files are available here for download. Note that you have to modify the names of the input file to point to a file available in your computer and the output file to a filename of your choice. It is assumed that the input file is located in the base directory of the Stanford PoS Tagger. If your input file is located in another directory, be sure to specify the full path; the same applies to the output file.
You can test the tagger by tagging the file “sample-inout.txt” that ships with the tagger and is located in the tagger directory. Use the following command to do so:
java -mx500m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\english-left3words-distsim.tagger” -textFile “sample-input.txt” > “my-sample-output.txt”
For future use, copy the command to a plain text file and save it under the name: my-stanford-pos.bat
. You can then run this command from this batch file in the terminal.
The next example shows how you can pos tag any other file in your file system. In this case,
java -mx500m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\english-left3words-distsim.tagger” -textFile “C:\Users\Public\corpora\BarackObamaSpeeches\OSC2002-2009\P-Obama-Inaugural-Speech-Inauguration.htm.txt” > “C:\Users\Public\corpora\BarackObamaSpeeches\OSC2002-2009\P-Obama-Inaugural-Speech-Inauguration-out.txt”
Please note: you need to copy the file stanford-postagger.bat
to your Stanford PoS Tagger directory and make sure the input file is located in the same directory or specify the path to the file as in the Obama Inauguration example above.
CAUTION: Should you decide to copy and paste the above command into your terminal or your own batch file, please make sure that everything is on one single line and there are no line-breaks. Also ensure that the quotation marks are not turned into “curly” typographic quotation marks (see References below for more on this) when you copy and paste; this will sometimes happen depending on your combination of browser and editor. If it does happen, make sure you overwrite them in your editor with simple quotation marks, then save the file.
Note: your text editor may well be showing this call on two lines without actually inserting a line break, but simple visually breaking the line at the window border, so it may look like there is more than one line when in fact there technically is not another line.
java -mx300m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\english-left3words-distsim.tagger”
-textFile xmlIn.xml > outfile.xml -outputFormat xml -xmlInput body
Different tagging models are available for the following languages:
In order to tag texts in a different language, select a different model from the \models folder. There are a variety of models available with the tagger both for English and the other languages mentioned above. Additionally, the tagger can be trained for other languages.
Please note that for different languages the tagger uses different tag sets as there is no universal tag-set that fits all linguistic phenomena in all languages. Make sure you find out what tag-set is being used in a model for a specific language and what the tags mean.
In order to use the Stanford PoS tagger to tag German plain text, all you have to do is change the model to one of the models for German, e.g. “\models\german-fast.tagger”
, “\models\german-hgc.tagger”
or “\models\german-ud.tagger”
and of course adjust the names of the input and output files.
java -mx300m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\german-fast.tagger” -textFile “goethe-faust-1.txt” > “goethe-faust-1.out”
Note that this is only possible if you have the “\models\german-fast.tagger”
in the subfolder “\models”
.
Alternatively, with the new version of the Stanford PoS Tagger (2022) you can use the “german-ud.tagger”
model which is bigger than the “german-fast.tagger”
and the following command:
java -mx300m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “\models\german-fast.tagger” -tokenize False -textFile sample-de-input.txt > sample-de-output.txt -encoding “utf-8”
Tagging data in the Arabic language works according to the same pattern as for English and German. All you have to do is select an input file with Arabic language data and specify a model for the Arabic language from the models\
folder:
java -mx1000m -cp “stanford-postagger.jar;” edu.stanford.nlp.tagger.maxent.MaxentTagger -model “models\arabic.tagger” -textFile sample-ar-input.txt > sample-ar-input.pos
The Stanford PoS Tagger is an easy-to-use Part of Speech Tagger which can be installed easily and which is usable for free. It is language independent; models for different languages are available and the tagger can be trained on new data. The Stanford PoS Tagger is used in state of the art applications.
Website for the Stanford PoS Tagger by the Stanford NLP Group Stanford log-linear part of speech tagger
Butterick's Practical Typography on Straight and curly quotes