Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_core_nlp [2019/01/26 22:42]
sabinebartsch [Stanford Core NLP]
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_core_nlp [2019/05/15 11:22]
sabinebartsch [Requirements]
Line 1: Line 1:
 ====== Stanford Core NLP ====== ====== Stanford Core NLP ======
  
-author: Sabine Bartsch, ​TU Darmstadt+==== author: Sabine Bartsch, ​Technische Universität ​Darmstadt ​====
  
 ''​Tutorial status: tested with Core NLP 3.9.2 and below''​ ''​Tutorial status: tested with Core NLP 3.9.2 and below''​
  
 +===== Related tutorials: =====
  
- +  * [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_core_nlp_multiple_files|Stanford CoreNLP Tools: Processing multiple files in a directory]] 
 +  * [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_core_nlp_server|Running the Stanford CoreNLP Tools as a local server]]
 ====== 1 What are the Stanford Core NLP Tools? ====== ====== 1 What are the Stanford Core NLP Tools? ======
  
Line 18: Line 19:
 ===== Requirements ===== ===== Requirements =====
  
-The Stanford Core NLP Tools require a running Java installation. As many software programs in corpus and computational linguistics require Java and Java is widely used for development in the field anyway, it is advisable to install a full Java JDK (Java Development Kit) which can be downloaded ​from the [[https://​www.oracle.com/​technetwork/​java/​javase/​downloads/​index.html|Oracle Java JDK site]]. This also includes the widely used JRE (Java Runtime Environment) which is a prerequisite for the execution of many different software programs.+The Stanford Core NLP Tools require a running Java installation. As many software programs in corpus and computational linguistics require Java and Java is widely used for development in the field anyway, it is advisable to install a full Java JDK (Java Development Kit) which can be downloaded ​as the [[linguisticsweb:​tutorials:​linguistics_tutorials:​basics:​environment:java|Open JDK]] from multiple sources. This also includes the widely used JRE (Java Runtime Environment) which is a prerequisite for the execution of many different software programs.
  
 Note that the Stanford CoreNLP Tools standardly require Java 8. However, later versions of Java do work fine if you add the following to your command: Note that the Stanford CoreNLP Tools standardly require Java 8. However, later versions of Java do work fine if you add the following to your command:
Line 26: Line 27:
 It will be shown in an example further down the page where to insert this call for the Java JAXB module. It will be shown in an example further down the page where to insert this call for the Java JAXB module.
  
-Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to have 4 GB free for the Core NLP tools to use all of the annotators simultaneously;​ note that even more might be required ​for processing larger data sets.+Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to be able to allocate at leasz 3 - 4 GB of RAM for the Core NLP tools, more if you are planning ​to use all of the annotators simultaneously;​ note that depending on the size of the models and your data set even more RAM may be required ​(6 - 8 GB free RAM can come in handy).
  
  
Line 69: Line 70:
 ''​java -cp "​*"​ -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,​ssplit,​pos,​lemma,​ner,​parse,​dcoref -file input.txt -outputFormat xml''​ ''​java -cp "​*"​ -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,​ssplit,​pos,​lemma,​ner,​parse,​dcoref -file input.txt -outputFormat xml''​
  
-====== Annotating multiple files in a directory ====== 
- 
-While annotating a single file is sometimes all you want to do, the typical corpus linguistic annotation task is likely to require the annotation of multiple files. So far, we have explored different sets of annotators for annotating a single file. In order to accomplish this, we have used the switch ''​-file''​ followed by the name of a single file to be annotated. The next task is setting up a batch file to run the same annotators on more than one file. 
- 
-In order to run the Stanford CoreNLP tools on multiple files, we need a list of files to be annotated. So let us create a list of files located in a particular directory. The contents of the file should contain a filename with its directory path per line: 
- 
-''​c:​\Users\Public\CORPUS-DIRECTORY\corpusfile01.txt''​ 
- 
-''​c:​\Users\Public\CORPUS-DIRECTORY\corpusfile02.txt''​ 
- 
-''​c:​\Users\Public\CORPUS-DIRECTORY\corpusfile03.txt''​ 
- 
-''​c:​\Users\Public\CORPUS-DIRECTORY\corpusfileNN.txt''​ 
- 
-''​...''​ 
-===== Using DIR or LS to list files ===== 
- 
-Let us first of all remind ourselves of what we already know: 
- 
-In order to list all text files in a directory, we use the dir command in the Windows terminal (cmd.exe) or ls in the shell of UNIX-like operating systems (Linux, Mac OS) or the Windows Powershell. However, the output provided by issuing a plain ''​dir''​ or ''​ls''​ command is not quite what we need in terms of a list of files with their full path and nothing else.  
- 
-Windows command output generated by running the command ''​dir'':​ 
- 
-{{:​linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​cmd-dir.png?​600|}} 
- 
-This is not quite what we need as an input file for the CoreNLP tools. So let us find out how we can get the output we need for the CoreNLP tools. With ''​dir''​ and ''​ls'',​ we are on the right track, but need to modify those commands in order to get the correct output format. Fortunately,​ the commands can take parameters specifying the output format of the commands. 
- 
-===== Commands according to shell types and operating systems ===== 
- 
-==== Microsoft Windows==== 
- 
-Windows command prompt ("​Eingabeaufforderung"​) (cmd): 
-''​dir /B /S *.txt > filelist.lst''​ 
- 
-writes the desired output of all text files with their full paths, one entry per line to the file ''​filelist.lst''​ which can serve as the input file for the CoreNLP tools. 
- 
-''​dir''​ calls the function listing the contents of a directory 
- 
-''/​B''​ 
- 
-''/​S''​ 
- 
-''​*.txt''​ lists only files with the extension .txt 
- 
-> pipes the output to a file instead of standard output (terminal) 
- 
-''​filelist.txt''​ is the target file to which the output of the command is written 
- 
- 
-==== Unix-style OSes: Linux and Mac OS ==== 
- 
-Unix-style operating systems such as Linux or Mac OS: 
-''​ls -d -1 $PWD/*.* > filelist.lst''​ 
- 
- 
-Windows Powershell: 
-''​(Get-ChildItem C:​\CORPUS-DIRECTORY\ -Recurse).fullname > filelist.lst''​ 
- 
-or shorter: 
-''​(gci -r C:​\CORPUS-DIRECTORY\).fullname > filelist.lst''​ 
  
 +====== Processing pipelines for different languages ======
  
-===== Using the filelist in CoreNLP =====+The example pipelines illustrated so far all annotate English data. However, there are models and configurations available for languages other than English, e.g. Arabic, Chinese, German and Spanish. In this section, some examples are shown for processing different languages.
  
-To annotate the files in the file ''​filelist.lst'',​ we have to make known to CoreNLP that this is your list of annotation target files. We do this by means of the parameter ''​-filelist''​ followed by the name of the ''​filelist.lst''​ with its full path if it is not located in the CoreNLP directory. ​ Note that the parameter ''​-filelist''​ is used here instead of -file which we used to address just a single file:+===== Processing pipeline for Chinese text =====
  
-**:: Stanford Core NLP batch file to be called from the commandline**+In order to process text written in the Chinese language, you can use the following command (also from a batch file):
  
-**:: calls the Stanford Core NLP Tools with input files from a specified directory (-filelist) and writes output to a directory of the user's choice (-outputDirectory)**+''​java ​-mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file chinese.txt -outputFormat xml''​
  
-''​java -cp "​*"​ -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,​ssplit,​pos,​lemma,​ner,​parse,​dcoref -filelist "D:\CORPUS-DIRECTORY\filelist.lst" -outputFormat xml -outputDirectory "​D:​\CORPUS-DIRECTORY\core-nlp-output\"''​+You will need to have the properties file ''​StanfordCoreNLP-chinese.properties''​ in the same directory as the CoreNLP toolsYou can download the properties file from here{{ :​linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanfordcorenlp-chinese.zip |Chinese properties file}}
  
-The last parameter used here is called ''​-outputDirectory''​ and allows you to tell the CoreNLP tools where to write the annotated output files. It needs to be followed by an existing output directory. This can be located in a directory parallel to the directory with the input files so that you can use this directory as a source for any further processing. Keep annotated files separate from original corpus files is a very good idea in order to keep some order in your file system.