This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_core_nlp [2019/01/26 22:42] sabinebartsch [Stanford Core NLP] |
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_core_nlp [2019/02/23 21:18] (current) sabinebartsch |
||
---|---|---|---|
Line 4: | Line 4: | ||
''Tutorial status: tested with Core NLP 3.9.2 and below'' | ''Tutorial status: tested with Core NLP 3.9.2 and below'' | ||
- | |||
- | |||
- | |||
====== 1 What are the Stanford Core NLP Tools? ====== | ====== 1 What are the Stanford Core NLP Tools? ====== | ||
Line 26: | Line 23: | ||
It will be shown in an example further down the page where to insert this call for the Java JAXB module. | It will be shown in an example further down the page where to insert this call for the Java JAXB module. | ||
- | Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to have 4 GB free for the Core NLP tools to use all of the annotators simultaneously; note that even more might be required for processing larger data sets. | + | Please be aware that the Core NLP tools require more memory than the stand-alone tools, so be sure your machine has sufficient RAM available. It would be good to be able to allocate at leasz 3 - 4 GB of RAM for the Core NLP tools, more if you are planning to use all of the annotators simultaneously; note that depending on the size of the models and your data set even more RAM may be required (6 - 8 GB free RAM can come in handy). |
Line 68: | Line 65: | ||
''java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt -outputFormat xml'' | ''java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt -outputFormat xml'' | ||
+ | |||
+ | |||
+ | ====== Processing pipeline for Chinese text ====== | ||
+ | |||
+ | In order to process text written in the Chinese language, you can use the following command (also from a batch file): | ||
+ | |||
+ | ''java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file chinese.txt -outputFormat xml'' | ||
+ | |||
+ | You will need to have the properties file ''StanfordCoreNLP-chinese.properties'' in the same directory as the CoreNLP tools. You can download the properties file from here: {{ :linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanfordcorenlp-chinese.zip |Chinese properties file}} | ||
+ | |||
====== Annotating multiple files in a directory ====== | ====== Annotating multiple files in a directory ====== | ||
Line 141: | Line 148: | ||
''java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist "D:\CORPUS-DIRECTORY\filelist.lst" -outputFormat xml -outputDirectory "D:\CORPUS-DIRECTORY\core-nlp-output\"'' | ''java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist "D:\CORPUS-DIRECTORY\filelist.lst" -outputFormat xml -outputDirectory "D:\CORPUS-DIRECTORY\core-nlp-output\"'' | ||
- | The last parameter used here is called ''-outputDirectory'' and allows you to tell the CoreNLP tools where to write the annotated output files. It needs to be followed by an existing output directory. This can be located in a directory parallel to the directory with the input files so that you can use this directory as a source for any further processing. Keep annotated files separate from original corpus files is a very good idea in order to keep some order in your file system. | + | The last parameter used here is called ''-outputDirectory'' and allows you to tell the CoreNLP tools where to write the annotated output files. It needs to be followed by a path to an existing output directory. This can be located in a directory parallel to the directory with the input files so that you can use this directory as a source for any further processing. Keep annotated files separate from original corpus files is a very good idea in order to keep some order in your file system. |