Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger [2019/01/26 22:51]
sabinebartsch [author: Sabine Bartsch, TU Darmstadt]
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger [2019/05/15 11:19]
sabinebartsch [2 Installation and requirements]
Line 1: Line 1:
 ====== The Stanford POS Tagger ====== ====== The Stanford POS Tagger ======
  
-==== author: Sabine Bartsch, ​TU Darmstadt ====+==== author: Sabine Bartsch, ​Technische Universität ​Darmstadt ====
  
-Tutorial builds on input from the Stanford ​NLP website.+Tutorial builds on software and input from the [[https://​nlp.stanford.edu/​software/​tagger.html|Stanford ​PoS Tagger ​website]].
  
 Related tutorial: [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_pos_tagger_python|Stanford PoS Tagger: tagging from Python]] Related tutorial: [[linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford_pos_tagger_python|Stanford PoS Tagger: tagging from Python]]
Line 17: Line 17:
 =====  2 Installation and requirements ====== =====  2 Installation and requirements ======
  
-Requirements:​ The Stanford PoS Tagger requires Java. As many programmes in corpus and computational linguistics require Java and as Java is used widely in this field, it is advisable to install the full Java JDK (Java Development Kit) which includes also the JRE (Java Runtime Environment). Please consult the following page to download software that is a system prerequisite for many corpus and computational linguistic applications:​ [[https://​www.oracle.com/​technetwork/​java/​javase/​downloads/​index.html|Oracle Java]].+Requirements:​ The Stanford PoS Tagger requires Java. As many programmes in corpus and computational linguistics require Java and as Java is used widely in this field, it is advisable to install the full Java JDK (Java Development Kit) which includes also the JRE (Java Runtime Environment). Please consult the following page to download software that is a system prerequisite for many corpus and computational linguistic applications:​ [[linguisticsweb:​tutorials:​linguistics_tutorials:​basics:​environment:java|Open JDK]].
  
 The Stanford PoS Tagger does not require much of an installation. The following steps get you started in no time at all. The Stanford PoS Tagger does not require much of an installation. The following steps get you started in no time at all.
   - Download the latest version from the following website: http://​nlp.stanford.edu/​software/​tagger.shtml   - Download the latest version from the following website: http://​nlp.stanford.edu/​software/​tagger.shtml
-  - There are two download versions available, the basic **English Stanford Tagger version 3.9.2** and the full version of the **Stanford Tagger version 3.9.2** including additional models for English as well as models for Arabic, Chinese, French, Spanish, and German+  - There are two download versions available, the basic **English Stanford Tagger version 3.9.x** and the full version of the **Stanford Tagger version 3.9.x** including additional models for English as well as models for Arabic, Chinese, French, Spanish, and German
   - Unzip the .zip archive to a directory of your choice. Please make sure that the directory name contains no white space and that the path is not too long as this can cause problems keeping track of files and making backup copies.   - Unzip the .zip archive to a directory of your choice. Please make sure that the directory name contains no white space and that the path is not too long as this can cause problems keeping track of files and making backup copies.
   - File locations: It is advisable to decide on a location for your linguistics tools. In my case, I have long decided to put any tools that are not automatically installed under the default OS location to ''​c:​\Users\Public\utility\...''​ with subfolders for each tool (and version) at this location. I am finding this convenient as it keeps the directory path relatively short.   - File locations: It is advisable to decide on a location for your linguistics tools. In my case, I have long decided to put any tools that are not automatically installed under the default OS location to ''​c:​\Users\Public\utility\...''​ with subfolders for each tool (and version) at this location. I am finding this convenient as it keeps the directory path relatively short.
Line 45: Line 45:
 | **-model** ​ | different taggers are available, but at one has to be specified: e.g. edu.stanford.nlp.tagger.maxent.MaxentTagger | | **-model** ​ | different taggers are available, but at one has to be specified: e.g. edu.stanford.nlp.tagger.maxent.MaxentTagger |
 | **-textFile** ​ | for plain text input files  | | **-textFile** ​ | for plain text input files  |
-| -xmlInput ​ | Example value: <​body>;​ The value specified here determines the element of an xml file the contents of which is being tagged. ​ | +**-xmlInput**  | Example value: <​body>;​ The value specified here determines the element of an xml file the contents of which is being tagged. ​ | 
-| **-outputFormat** ​ | xml, tsv, slashTags, -tagSeparator \# |+| **-outputFormat** ​ | xml, tsv, slashTags, -tagSeparator \#|
  
  
Line 123: Line 123:
 Please note that for different languages the tagger uses different tag-sets as there is no universal tag-set that fits all linguistic phenomena in all languages. Make sure you find out what tag-set is being used in a model for a specific language and what the tags mean.  Please note that for different languages the tagger uses different tag-sets as there is no universal tag-set that fits all linguistic phenomena in all languages. Make sure you find out what tag-set is being used in a model for a specific language and what the tags mean. 
  
-  * English: ​the Penn Treebank site. There is a simple listings on the [[http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html|AMALGAM project page]] +  * English: [[https://www.ling.upenn.edu/courses/Fall_2003/ling001/​penn_treebank_pos.html|Penn Tree Bank tag set]] 
-  * Chinese: [[http://www.cis.upenn.edu/~chinese/|the Penn Chinese Treebank]] +  * Chinese: [[https://verbs.colorado.edu/​chinese/​posguide.3rd.ch.pdf|Penn Chinese Treebank]] 
-  * German: [[http://​www.ims.uni-stuttgart.de/forschung/​ressourcen/​lexika/​TagSets/stts-table.html|Stuttgart-Tübingen Tag Set (STTS)]]+  * German: [[http://​www.sfs.uni-tuebingen.de/resources/stts-1999.pdf|Stuttgart-Tübingen Tag Set (STTS)]]
   * French: [[http://​www.llf.cnrs.fr/​Gens/​Abeille/​French-Treebank-fr.php|the French Treebank]] ​   * French: [[http://​www.llf.cnrs.fr/​Gens/​Abeille/​French-Treebank-fr.php|the French Treebank]] ​