The Stanford Core NLP Tools subsume a set of the principal Stanford NLP Tools such as the Stanford POS Tagger, the Stanford Named Entity Recognizer, the Stanford Parser etc. in one integrated package together with models for English and a growing number of other languages. The Stanford Core NLP tools automatically generate annotations that are the foundation of many types of linguistic analysis such as part of speech tagging or dependency parsing as shown in the example below. These annotations are applied in so-called pipelines that integrate a number of different annotators into one workflow that generates integrated output in specific formats such as plain text – human-readable data, xml, json or the conll data format. More about this below.
The Core NLP tools are the right choice in scenarios in which a number of different types of annotations are to be applied to a set of files with the aim of generating an integrated output. The Core NLP tools should be used in scenarios in which sufficient computing power, esp. memory (RAM), is available. They are most efficiently employed to larger numbers of files due to their resource requirements, but more about this later. As the annotators integrated into the Core NLP tools mostly also exist as stand-alone tools, users may be wondering when to choose one over the other. For example, the Stanford PoS Tagger exists as a stand-alone tool in which case it comes integrated with the necessary pre-processing such as tokenization and sentence splitting, so users might be wondering whether to use the stand-alone version or the Core NLP version of the software.
This tutorial is for the most current version of the Stanford Core NLP 4.4.0 (2021-05). Tutorials for older versions - which may or may not be different - are kept here for reference purposes (see below).
The Stanford Core NLP Tools require a running Java installation. As many software programs in corpus and computational linguistics require Java and Java is widely used for development in the field anyway, it is advisable to install a full Java JDK (Java Development Kit) which can be downloaded as the Open JDK from multiple sources. This also includes the widely used JRE (Java Runtime Environment) which is a prerequisite for the execution of many different software programs.
Note that the Stanford CoreNLP Tools standardly require Java 8. Later versions should work fine as well. However, if you run into any unforeseen errors, you might want to try and add the following to your command:
–add-modules java.se.ee
[this might be optional, check without it first]
It will be shown in an example further down the page where to insert this call for the Java JAXB module.
Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to be able to allocate at leasz 3 - 4 GB of RAM for the Core NLP tools, more if you are planning to use all of the annotators simultaneously; note that depending on the size of the models and your data set even more RAM may be required (6 - 8 GB of available RAM can come in handy).
Java-based software such as the Stanford Core NLP Tools can be called directly from the command line by typing in the relevant command, however, it is usually best to write the command to a batch file and execute that instead of typing directly to the command-prompt. This is going to save you time and spare you frustration, because you can more easily edit and reuse commands and also document within the file what a specific command is supposed to do.
Assuming there is a file called input.txt in the directory of the Core NLP tools that you want to process, a simple command that would call three annotators might look like this:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file input.txt -outputFormat xml
Note: If you have upgraded to Java 9 or upwards, you may need to add an extra parameter –add-modules java.se.ee
to the above command in order to run the CoreNLP pipeline:
java --add-modules java.se.ee -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file input.txt -outputFormat xml
The switch -annotators
has three parameters “tokenize,ssplit,pos” that evoke the pre-processing steps tokenization and sentence splitting followed by the part of speech tagger, the switch -file
takes the name of the input file and the switch -outputFormat
takes as a parameter the format of the output to be created, in this case an xml file. You can copy this line into a text file, save it as core-nlp-pos.bat
into the Core NLP directory and evoke it from the command line. It will produce an output file called input.txt.xml
in the same directory.
The Stanford Core NLP Tools include the following annotators and more that are invoked by the switch -annotators
or in the .properties file
.
Note that availability of annotators varies between languages depending on whether models for a language and an annotation process are made available either by the Stanford NLP team or members of the community.
annotator | function |
tokenize | The tokenizer subdivides a text into individual tokens, i.e. words, punctuation marks etc. |
ssplit | The sentence splitter segments a text into sentences |
pos | The Stanford Part of Speech Tagger, assigns word class labels to each token according to a model and annotation scheme |
lemma | The lemmatizer provides the lemma or base form for each token. |
ner | The Stanford Named Entity Recognizer identifies tokens that are proper nouns as members of specific classes such as Person(al) name, Organization name etc. |
parse | The Stanford Parser analyses and annotates the syntactic structure of each sentence in the text. The Stanford Parer is actually not just one parser, but offers phrase structure parses as well as dependency parses. |
dcoref | The Stanford CorefAnnotator implements pronominal and nominal coreference resolution. |
Note that there are dependencies between the annotators, i.e. certain annotators require pre-processing by other annotators, e.g. tokenization and sentence splitting are obligatory pre-processing steps for part of speech tagging and parsing. You can look up the dependencies on the Standard CoreNLP Annotator dependencies page.
:: calls the Core NLP Tools with files from the Stanford Core NLP folder and outputs xml java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt -outputFormat xml
The example pipelines illustrated so far all annotate English data. However, there are models and configurations available for languages other than English [en], i.e. Arabic [ar], Chinese [zh], Frecnh [fr], German [de] and Spanish [es]. In this section, some examples are shown for processing different languages.
Annotator | ar | zh | en | fr | de | es | annotator |
Tokenize / Segment | ✔ | ✔ | ✔ | ✔ | ✔ | tokenize | |
Sentence Split | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ssplit |
Part of Speech | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | pos |
Lemma | ✔ | lemma | |||||
Named Entities | ✔ | ✔ | ✔ | ✔ | ✔ | ner | |
Mention Detection | ✔ | ✔ | entitymentions | ||||
Constituency Parsing | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | parse |
Dependency Parsing | ✔ | ✔ | ✔ | ✔ | depparse | ||
Sentiment Analysis | ✔ | sentiment | |||||
Coreference | ✔ | ✔ | coref | ||||
Open IE | ✔ | openie | |||||
Quote Extraction | ✔ | ✔ | ✔ | quote |
Annotator | ar | zh | en | fr | de | es | annotator |
Tokenize / Segment | ✔ | ✔ | ✔ | ✔ | ✔ | tokenize | |
Sentence Split | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ssplit |
Part of Speech | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | pos |
Lemma | ✔ | lemma | |||||
Named Entities | ✔ | ✔ | ✔ | ✔ | ✔ | ner | |
Mention Detection | ✔ | ✔ | entitymentions | ||||
Constituency Parsing | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | parse |
Dependency Parsing | ✔ | ✔ | ✔ | ✔ | depparse | ||
Sentiment Analysis | ✔ | sentiment | |||||
Coreference | ✔ | ✔ | coref | ||||
Open IE | ✔ | openie |
In order to process text written in the Chinese language, you can use the following command (also from a batch file):
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file chinese.txt -outputFormat xml
You will need to have the properties file StanfordCoreNLP-chinese.properties
in the same directory as the CoreNLP tools.