Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python [2019/01/30 17:00]
sabinebartsch
linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python [2019/03/07 18:32] (current)
sabinebartsch
Line 14: Line 14:
 NLTK integrates a version of the Stanford PoS tagger as a module that can be run without a separate local installation of the tagger. This is the simplest way of running the Stanford PoS Tagger from Python. It has, however, a disadvantage in that users have no choice between the models used for tagging. This is, however, a good way of getting started using the tagger. The script below gives an example of a script using the Stanford PoS Tagger module of NLTK to tag an example sentence: NLTK integrates a version of the Stanford PoS tagger as a module that can be run without a separate local installation of the tagger. This is the simplest way of running the Stanford PoS Tagger from Python. It has, however, a disadvantage in that users have no choice between the models used for tagging. This is, however, a good way of getting started using the tagger. The script below gives an example of a script using the Stanford PoS Tagger module of NLTK to tag an example sentence:
  
-{{:​linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford-pos-nltk-py-01.png?​nolink&​600|}}+<sxh python>​ 
 +# running the Stanford POS Tagger from NLTK
  
-Note the ''​for''​-loop in lines 15-16 that converts the tagged output (a list of tuples) into the two-column format: ''​word_tag''​.+import nltk 
 +from nltk import word_tokenize 
 +from nltk import StanfordTagger 
 + 
 +text_tok = nltk.word_tokenize("​Just a small snippet of text."​) 
 + 
 +# print(text_tok) 
 +pos_tagged = nltk.pos_tag(text_tok) 
 + 
 +# print the list of tuples: (word,​word_class) 
 +print(pos_tagged) 
 + 
 +# for loop to extract the elements of the tuples in the pos_tagged list 
 +# print the word and the pos_tag with the underscore as a delimiter 
 +for word,​word_class in pos_tagged:​ 
 +    print(word + "​_"​ + word_class) 
 +</​sxh>​ 
 + 
 +Note the ''​for''​-loop in lines 17-18 that converts the tagged output (a list of tuples) into the two-column format: ''​word_tag''​.
  
 This same script can be easily modified to tag a file located in the file system: This same script can be easily modified to tag a file located in the file system:
  
-{{:​linguisticsweb:tutorials:​linguistics_tutorials:​automaticannotation:​stanford-pos-nltk-py-02.png?​nolink&​600|}}+<sxh python>​ 
 +# running the Stanford POS Tagger from NLTK 
 +import nltk 
 +from nltk import word_tokenize 
 +from nltk import StanfordTagger 
 + 
 +# point this path to a utf-8 encoded plain text file in your own file system 
 +f = "C:/​Users/​Public/​projects/​python101-2018/​data/​sample-text.txt"​ 
 + 
 +text_raw = open(f).read() 
 +text = nltk.word_tokenize(text_raw) 
 +pos_tagged = nltk.pos_tag(text) 
 + 
 +# print the list of tuples: (word,​word_class) 
 +# this is just a test, comment out if you do not want this output 
 +print(pos_tagged) 
 + 
 +# for loop to extract the elements of the tuples in the pos_tagged list 
 +# print the word and the pos_tag with the underscore as a delimiter 
 +for word,​word_class in pos_tagged:​ 
 +    print(word + "​_"​ + word_class) 
 +</​sxh>​
  
 Note that you need to adjust the path in line 8 above to point to a UTF-8 encoded plain text file that actually exists in your local file system. Note that you need to adjust the path in line 8 above to point to a UTF-8 encoded plain text file that actually exists in your local file system.
Line 30: Line 70:
 In the code itself, you have to point Python to the location of your Java installation:​ In the code itself, you have to point Python to the location of your Java installation:​
  
-''​+<sxh bash; gutter: false>
 java_path = "​C:/​Program Files/​Java/​jdk1.8.0_192/​bin/​java.exe" ​ java_path = "​C:/​Program Files/​Java/​jdk1.8.0_192/​bin/​java.exe" ​
-''​ 
- 
-''​ 
 os.environ["​JAVAHOME"​] = java_path os.environ["​JAVAHOME"​] = java_path
-''​+</​sxh>​
  
 You also have to explicitly state the paths to the Stanford PoS Tagger .jar file and the Stanford PoS Tagger model to be used for tagging: You also have to explicitly state the paths to the Stanford PoS Tagger .jar file and the Stanford PoS Tagger model to be used for tagging:
  
-''​+<sxh bash; gutter: false>
 jar = "​C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​stanford-postagger.jar"​ jar = "​C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​stanford-postagger.jar"​
-''​ 
- 
-''​ 
 model = "​C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​models/​english-bidirectional-distsim.tagger"​ model = "​C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​models/​english-bidirectional-distsim.tagger"​
-''​+</​sxh>​
  
 Note that these paths vary according to your system configuration. You will need to check your own file system for the exact locations of these files, although Java is likely to be installed somewhere in ''​C:​\Program Files\''​ or ''​C:​\Program Files (x86)''​ in a Windows system. Note that these paths vary according to your system configuration. You will need to check your own file system for the exact locations of these files, although Java is likely to be installed somewhere in ''​C:​\Program Files\''​ or ''​C:​\Program Files (x86)''​ in a Windows system.
Line 55: Line 89:
 The next example illustrates how you can run the Stanford PoS Tagger on a sample sentence: The next example illustrates how you can run the Stanford PoS Tagger on a sample sentence:
  
-{{:linguisticsweb:tutorials:​linguistics_tutorials:​automaticannotation:stanford-pos-local-py-01.png?​nolink&​800|}}+<sxh python>​ 
 +# Stanford POS tagger - Python workflow for using a locally installed version of the Stanford POS Tagger 
 +# Python version 3.7.1 | Stanford POS Tagger stand-alone version 2018-10-16 
 + 
 +import nltk 
 +from nltk import * 
 +from nltk.tag.stanford import StanfordPOSTagger 
 +from nltk.tokenize import word_tokenize 
 + 
 +# enter the path to your local Java JDK, under Windows, the path should look very similar to this example 
 +java_path = "C:/Program Files/​Java/​jdk1.8.0_192/​bin/​java.exe"​ 
 +os.environ["​JAVAHOME"​] = java_path 
 + 
 +# enter the paths to the Stanford POS Tagger .jar file as well as to the model to be used 
 +jar = "C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​stanford-postagger.jar"​ 
 +model = "C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​models/​english-bidirectional-distsim.tagger"​ 
 + 
 +pos_tagger = StanfordPOSTagger(model,​ jar, encoding = "​utf-8"​) 
 + 
 +# Tagging this one example sentence as a test: 
 +# this small snippet of text lets you test whether the tagger is running before you attempt to run it on a locally 
 +# stored file (see line 28) 
 +text = "Just a small snippet of text to test the tagger."​ 
 + 
 +# Tagging a locally stored plain text file: 
 +# as soon as the example in line 22 is running ok, comment out that line (#) and comment in the next line and 
 +# enter a path to a local file of your choice; 
 +# the assumption made here is that the file is a plain text file with utf-8 encoding 
 +# text = open("​C:/​Users/​Public/​projects/​python101-2018/​data/​sample-text.txt"​).read() 
 + 
 +# nltk word_tokenize() is used here to tokenize the text and assign it to a variable '​words'​ 
 +words = nltk.word_tokenize(text) 
 +# print(words) 
 +# the pos_tagger is called here with the parameter '​words'​ so that the value of the variable '​words'​ is assigned pos tags 
 +tagged_words = pos_tagger.tag(words) 
 +print(tagged_words) 
 +</​sxh>​
  
  
Line 62: Line 132:
 The code above can be run on a local file with very little modification. In this example, the sentence snippet in line 22 has been commented out and the path to a local file has been commented in: The code above can be run on a local file with very little modification. In this example, the sentence snippet in line 22 has been commented out and the path to a local file has been commented in:
  
-{{:linguisticsweb:tutorials:​linguistics_tutorials:​automaticannotation:stanford-pos-local-py-02.png?​nolink&​800|}}+<sxh python>​ 
 +# Stanford POS tagger - Python workflow for using a locally installed version of the Stanford POS Tagger 
 +# Python version 3.7.1 | Stanford POS Tagger stand-alone version 2018-10-16 
 + 
 +import nltk 
 +from nltk import * 
 +from nltk.tag.stanford import StanfordPOSTagger 
 +from nltk.tokenize import word_tokenize 
 + 
 +# enter the path to your local Java JDK, under Windows, the path should look very similar to this example 
 +java_path = "C:/Program Files/​Java/​jdk1.8.0_192/​bin/​java.exe"​ 
 +os.environ["​JAVAHOME"​] = java_path 
 + 
 +# enter the paths to the Stanford POS Tagger .jar file as well as to the model to be used 
 +jar = "C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​stanford-postagger.jar"​ 
 +model = "C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​models/​english-bidirectional-distsim.tagger"​ 
 + 
 +pos_tagger = StanfordPOSTagger(model,​ jar, encoding = "​utf-8"​) 
 + 
 +# Tagging this one example sentence as a test: 
 +# this small snippet of text lets you test whether the tagger is running before you attempt to run it on a locally 
 +# stored file (see line 28) 
 +# text = "Just a small snippet of text to test the tagger."​ 
 + 
 +# Tagging a locally stored plain text file: 
 +# as soon as the example in line 22 is running ok, comment out that line (#) and comment in the next line and 
 +# enter a path to a local file of your choice; 
 +# the assumption made here is that the file is a plain text file with utf-8 encoding 
 +text = open("​C:/​Users/​Public/​projects/​python101-2018/​data/​sample-text.txt"​).read() 
 + 
 +# nltk word_tokenize() is used here to tokenize the text and assign it to a variable '​words'​ 
 +words = nltk.word_tokenize(text) 
 +# print(words) 
 +# the pos_tagger is called here with the parameter '​words'​ so that the value of the variable '​words'​ is assigned pos tags 
 +tagged_words = pos_tagger.tag(words) 
 +print(tagged_words) 
 +</​sxh>​
  
  
Line 70: Line 176:
 As we will be writing output of the two subprocesses of tokenization and tagging to files in your file system, you have to create these output directories in your file system and again write down or copy the locations for further use. In this example these directories are called: As we will be writing output of the two subprocesses of tokenization and tagging to files in your file system, you have to create these output directories in your file system and again write down or copy the locations for further use. In this example these directories are called:
  
-''​data_path''​+<sxh bash; gutter:​false>​ 
 +data_path 
 +tokenized_data_path 
 +tagged_data_path 
 +</​sxh>​
  
-''​tokenized_data_path''​+Once you have installed the Stanford PoS Tagger, collected and adjusted all of this information in the file below and created the respective directories,​ you are set to run the following Python program:
  
-''​tagged_data_path''​+<sxh python; options>​ 
 +# Stanford POS tagger local installation to tag a directory of plain text files 
 +import nltk 
 +from nltk import * 
 +import os
  
-Once you have installed ​the Stanford PoS Tagger, ​collected and adjusted all of this information in the file below and created the respective ​directories, you are set to run the following Python program:+# environment variables for the Stanford PoS Tagger 
 +java_path = "​C:/​Program Files/​Java/​jdk1.8.0_192/​bin/​java.exe"​ 
 +os.environ["​JAVAHOME"​] = java_path 
 + 
 +from nltk.tag.stanford import StanfordPOSTagger 
 +#from nltk.tokenize import word_tokenize 
 + 
 +model = "​C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​models/​english-bidirectional-distsim.tagger"​ 
 +jar = "​C:/​Users/​Public/​utility/​stanford-postagger-full-2018-10-16/​stanford-postagger.jar"​ 
 + 
 +pos_tagger = StanfordPOSTagger(modeljar, encoding = "​utf-8"​) 
 + 
 +# data path of the input corpus files as well as separate output ​directories ​for tokenized and tagged data 
 +data_path = "​C:/​Users/​Public/​projects/​python101-2018/​data/​BPS/"​ 
 +tokenized_data_path = "​C:/​Users/​Public/​projects/​python101-2018/​data/​BPS/​tokenized_data/"​ 
 +tagged_data_path = "​C:/​Users/​Public/​projects/​python101-2018/​data/​BPS/​tagged_data/"​ 
 + 
 +# apply tokenization and pos-tagging ​to all the txt-files in the directory '​data'​ that follow a specific naming scheme 
 +for filename in os.listdir(data_path): 
 + if filename.startswith("​WC"​) and filename.endswith("​.txt"​):​ 
 + fr = open(data_path + filename, encoding = "​utf-8"​) 
 + raw_text = fr.read()  
 + tok_text = word_tokenize(raw_text) 
 + fw1 = open(tokenized_data_path + "​tok_"​ + filename, "​w"​) 
 + fw1.write(str(tok_text)) 
 + fw1.close() 
 + fw = open(tagged_data_path + "​tag_"​ + filename, "​w"​) 
 + fw.write(str(pos_tagger.tag(tok_text))) 
 + fw.close() 
 +</​sxh>​
  
-{{:​linguisticsweb:​tutorials:​linguistics_tutorials:​automaticannotation:​stanford-pos-local-py-03.png?​nolink&​800|}}