The Natural Language Toolkit (NLTK)

linguisticsweb.org tutorial status: in progress

The Natural Language Toolkit (NLTK) is a set of modules enhancing the natural language processing capabilities of the programming language Python. It is extremely useful for at least two reasons: it encourages linguists to learn programming and it is easy to learn.

Why learn programming?

Now, you might ask why linguists should learn how to write their own programs? Well, if you have ever felt the limitations placed on you by some out of the box software, been held back by a piece of software requiring the data in a format other than the one you have, or tried to build an entire processing chain including different processing steps, you will quickly appreciate why it can be extremely handy to be able to do your own programming. Being able to program also gives you a better understanding of the nature of your data and a more in-depth understanding of the algorithms and processing steps typically undertaken in natural language processing. And believe me when I say that it is also loads of fun.

Why Python? Why NLTK?

Well, linguists used to be very fond of Perl and there is good reason for this: Perl has little overhead and offers instant gratification because - like Python - mastering your first steps is relatively easy, there is also good documentation and a vast user community out there to help you with advice and by sharing sample code. It is also free (as in free beer) and writing Perl programs requires very little infrastructure. All you have to do is install one of the freely available Perl distributions and a good text editor, get an introductory Perl tutorial and you are good to go. BTW: the technical requirements are discussed under System requirements. Plus, Perl was developed by Larry Wall who, among other things, is a linguist. He developed Perl as a UNIX scripting language and endowed it with very powerful text processing facilities, one of its major strengths being its regular expression (RegEx) handling.

So, if Perl is so great, you may ask, why should I not learn Perl instead of Python. Well, actually, you could and there is nothing I can or will say to keep you from doing so. After all, oogles of linguists have done so and benefited greatly and at the end of the day, the most important thing is that you look into programming at all and that, if you find it useful, have fun learning to program. However, there are some good reasons that might convince you to learn Python. Some arguments you will always be hearing from people arguing in favour of Python are that it is more modern, simply newer, that the code is very readable for humans, that it is in many ways just as powerful or even more powerful than Perl and so on and so forth. Two very powerful arguments for Python from the perspective of the linguist are its regular expression powers and - the existence of the Natural Language Toolkit (NLTK) initiated by Steven Bird, Ed Loper et al. (see below). But first things first.

Python is an object-oriented programming language that has extensive scripting functionality, but can also be used in other context. Its design makes it very suited to the learner, because the code is and remains very legible. One central feature is that Python uses white-space indentation of the code instead of curly brackets etc. which also helps keep the code readable and organized. Python was inspired by Perl and other languages, it is a more modern later development which may add to its appeal. It has great regular expression handling, much like Perl, and is thus ideally suited as a scripting language.

Python has NLTK

A very important feature that contributes to the popularity of Python among linguists is that NLTK was developed for Python. NLTK adds custom functionalities to Python, implements many standard tools that linguists like to use such as tokenizers, part of speech taggers, parsers and corpus access, and is extremely well documented through the NLTK team. The team have written an O'Reilly book about Python and NLTK:

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. Analyzing Text with the Natural Language Toolkit. O'Reilly.

This book is also available online from the web and many good university libraries have copies of the book available.

NLTK offers many of the tools and functionalities that linguists and other natural language processing folk find very useful. Apart from harnessing Pythons regular expression prowess, it has lots of functions implemented that linguists find useful such as tokenization, pos tagging etc.