This is an old revision of the document!


Stanford Core NLP

author: Sabine Bartsch, TU Darmstadt

Tutorial status: tested with Core NLP 3.9.2 and below

1 What are the Stanford Core NLP Tools?

The Stanford Core NLP Tools subsume the set of the principal Stanford NLP Tools such as the Stanford POS Tagger, the Stanford Named Entity Recognizer, the Stanford Parser etc. in one integrated package together with models for English and a number of other languages. It is thus a viable choice if you know from the start that you are going to be processing English texts or texts in any of the languages for which models exist. The Stanford Core NLP tools can automatically generate annotations that are the foundation of many types of linguistic analysis such as part of speech tagging or dependency parsing as shown in the example shown here.

2 Requirements and set-up

Requirements

The Stanford Core NLP Tools require a running Java installation. As many software programs in corpus and computational linguistics require Java and Java is widely used for development in the field anyway, it is advisable to install a full Java JDK (Java Development Kit) which can be downloaded from the Oracle Java JDK site. This also includes the widely used JRE (Java Runtime Environment) which is a prerequisite for the execution of many different software programs.

Note that the Stanford CoreNLP Tools standardly require Java 8. However, later versions of Java do work fine if you add the following to your command:

–add-modules java.se.ee

It will be shown in an example further down the page where to insert this call for the Java JAXB module.

Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to have 4 GB free for the Core NLP tools to use all of the annotators simultaneously; note that even more might be required for processing larger data sets.

Download

3 Quick start guide

Java-based software such as the Stanford Core NLP Tools can be called directly from the command line by typing in the relevant command, however, it is usually best to write the command to a file and execute that file. This is going to save you time and spare you frustration, because you can more easily edit and reuse commands and also document within the file what a specific command is supposed to do.

Assuming there is a file called input.txt in the directory of the Core NLP tools that you want to process, a simple command that would call three annotators might look like this:

java -cp “*” -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file input.txt -outputFormat xml

Note: If you have upgraded to Java 9 or upwards, you need to add an extra parameter –add-modules java.se.ee to the above command in order to run the CoreNLP pipeline:

java –add-modules java.se.ee -cp “*” -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file input.txt -outputFormat xml

The switch -annotators has three parameters “tokenize,ssplit,pos” that evoke the pre-processing steps tokenization and sentence splitting followed by the part of speech tagger, the switch -file takes the name of the input file and the switch -outputFormat takes as a parameter the format of the output to be created, in this case an xml file. You can copy this line into a text file, save it as core-nlp-pos.bat into the Core NLP directory and evoke it from the command line. It will produce an output file called input.txt.xml in the same directory.

Annotators included in the CoreNLP tools

The Stanford Core NLP Tools offer the following annotators that are invoked by the switch -annotators:

annotatorfunction
tokenizeThe tokenizer subdivides a text into individual tokens, i.e. words, punctuation marks etc.
ssplitThe sentence splitter segments a text into sentences
posThe Stanford Part of Speech Tagger, assigns word class labels to each token according to a model and annotation scheme
lemmaThe lemmatizer provides the lemma or base form for each token.
nerThe Stanford Named Entity Recognizer identifies tokens that are proper nouns as members of specific classes such as Person(al) name, Organization name etc.
parseThe Stanford Parser analyses and annotates the syntactic structure of each sentence in the text. The Stanford Parer is actually not just one parser, but offers phrase structure parses as well as dependency parses.
dcorefThe Stanford CorefAnnotator implements pronominal and nominal coreference resolution.

Note that there are dependencies between the annotators, i.e. certain annotators require pre-processing by other annotators, e.g. tokenization and sentence splitting are obligatory pre-processing steps for part of speech tagging and parsing. You can look up the dependencies on the Standard CoreNLP Annotator dependencies page.

Running the Stanford CoreNLP tools with multiple annotators

:: calls the Core NLP Tools with files from the Stanford Core NLP folder and outputs xml

java -cp “*” -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt -outputFormat xml

Annotating multiple files in a directory

While annotating a single file is sometimes all you want to do, the typical corpus linguistic annotation task is likely to require the annotation of multiple files. So far, we have explored different sets of annotators for annotating a single file. In order to accomplish this, we have used the switch -file followed by the name of a single file to be annotated. The next task is setting up a batch file to run the same annotators on more than one file.

In order to run the Stanford CoreNLP tools on multiple files, we need a list of files to be annotated. So let us create a list of files located in a particular directory. The contents of the file should contain a filename with its directory path per line:

c:\Users\Public\CORPUS-DIRECTORY\corpusfile01.txt

c:\Users\Public\CORPUS-DIRECTORY\corpusfile02.txt

c:\Users\Public\CORPUS-DIRECTORY\corpusfile03.txt

c:\Users\Public\CORPUS-DIRECTORY\corpusfileNN.txt

Using DIR or LS to list files

Let us first of all remind ourselves of what we already know:

In order to list all text files in a directory, we use the dir command in the Windows terminal (cmd.exe) or ls in the shell of UNIX-like operating systems (Linux, Mac OS) or the Windows Powershell. However, the output provided by issuing a plain dir or ls command is not quite what we need in terms of a list of files with their full path and nothing else.

Windows command output generated by running the command dir:

This is not quite what we need as an input file for the CoreNLP tools. So let us find out how we can get the output we need for the CoreNLP tools. With dir and ls, we are on the right track, but need to modify those commands in order to get the correct output format. Fortunately, the commands can take parameters specifying the output format of the commands.

Commands according to shell types and operating systems

Microsoft Windows

Windows command prompt (“Eingabeaufforderung”) (cmd): dir /B /S *.txt > filelist.lst

writes the desired output of all text files with their full paths, one entry per line to the file filelist.lst which can serve as the input file for the CoreNLP tools.

dir calls the function listing the contents of a directory

/B

/S

*.txt lists only files with the extension .txt

pipes the output to a file instead of standard output (terminal)

filelist.txt is the target file to which the output of the command is written

Unix-style OSes: Linux and Mac OS

Unix-style operating systems such as Linux or Mac OS: ls -d -1 $PWD/*.* > filelist.lst

Windows Powershell: (Get-ChildItem C:\CORPUS-DIRECTORY\ -Recurse).fullname > filelist.lst

or shorter: (gci -r C:\CORPUS-DIRECTORY\).fullname > filelist.lst

Using the filelist in CoreNLP

To annotate the files in the file filelist.lst, we have to make known to CoreNLP that this is your list of annotation target files. We do this by means of the parameter -filelist followed by the name of the filelist.lst with its full path if it is not located in the CoreNLP directory. Note that the parameter -filelist is used here instead of -file which we used to address just a single file:

:: Stanford Core NLP batch file to be called from the commandline

:: calls the Stanford Core NLP Tools with input files from a specified directory (-filelist) and writes output to a directory of the user's choice (-outputDirectory)

java -cp “*” -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist “D:\CORPUS-DIRECTORY\filelist.lst” -outputFormat xml -outputDirectory “D:\CORPUS-DIRECTORY\core-nlp-output\”

The last parameter used here is called -outputDirectory and allows you to tell the CoreNLP tools where to write the annotated output files. It needs to be followed by an existing output directory. This can be located in a directory parallel to the directory with the input files so that you can use this directory as a source for any further processing. Keep annotated files separate from original corpus files is a very good idea in order to keep some order in your file system.