Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_core_nlp [2019/01/26 22:42]
sabinebartsch [Stanford Core NLP]
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_core_nlp [2019/02/23 21:18] (current)
sabinebartsch
Line 4: Line 4:
  
 ''​Tutorial status: tested with Core NLP 3.9.2 and below''​ ''​Tutorial status: tested with Core NLP 3.9.2 and below''​
- 
- 
- 
  
 ====== 1 What are the Stanford Core NLP Tools? ====== ====== 1 What are the Stanford Core NLP Tools? ======
Line 26: Line 23:
 It will be shown in an example further down the page where to insert this call for the Java JAXB module. It will be shown in an example further down the page where to insert this call for the Java JAXB module.
  
-Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to have 4 GB free for the Core NLP tools to use all of the annotators simultaneously;​ note that even more might be required ​for processing larger data sets.+Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to be able to allocate at leasz 3 - 4 GB of RAM for the Core NLP tools, more if you are planning ​to use all of the annotators simultaneously;​ note that depending on the size of the models and your data set even more RAM may be required ​(6 - 8 GB free RAM can come in handy).
  
  
Line 68: Line 65:
  
 ''​java -cp "​*"​ -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,​ssplit,​pos,​lemma,​ner,​parse,​dcoref -file input.txt -outputFormat xml''​ ''​java -cp "​*"​ -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,​ssplit,​pos,​lemma,​ner,​parse,​dcoref -file input.txt -outputFormat xml''​
 +
 +
 +====== Processing pipeline for Chinese text ======
 +
 +In order to process text written in the Chinese language, you can use the following command (also from a batch file):
 +
 +''​java -mx3g -cp "​*"​ edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file chinese.txt -outputFormat xml''​
 +
 +You will need to have the properties file ''​StanfordCoreNLP-chinese.properties''​ in the same directory as the CoreNLP tools. You can download the properties file from here: {{ :​linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanfordcorenlp-chinese.zip |Chinese properties file}}
 +
  
 ====== Annotating multiple files in a directory ====== ====== Annotating multiple files in a directory ======
Line 141: Line 148:
 ''​java -cp "​*"​ -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,​ssplit,​pos,​lemma,​ner,​parse,​dcoref -filelist "​D:​\CORPUS-DIRECTORY\filelist.lst"​ -outputFormat xml -outputDirectory "​D:​\CORPUS-DIRECTORY\core-nlp-output\"''​ ''​java -cp "​*"​ -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,​ssplit,​pos,​lemma,​ner,​parse,​dcoref -filelist "​D:​\CORPUS-DIRECTORY\filelist.lst"​ -outputFormat xml -outputDirectory "​D:​\CORPUS-DIRECTORY\core-nlp-output\"''​
  
-The last parameter used here is called ''​-outputDirectory''​ and allows you to tell the CoreNLP tools where to write the annotated output files. It needs to be followed by an existing output directory. This can be located in a directory parallel to the directory with the input files so that you can use this directory as a source for any further processing. Keep annotated files separate from original corpus files is a very good idea in order to keep some order in your file system.+The last parameter used here is called ''​-outputDirectory''​ and allows you to tell the CoreNLP tools where to write the annotated output files. It needs to be followed by a path to an existing output directory. This can be located in a directory parallel to the directory with the input files so that you can use this directory as a source for any further processing. Keep annotated files separate from original corpus files is a very good idea in order to keep some order in your file system.