Beginner's Guide to WordFreak & the OpenNLP Tools

*How-to by Michael Hanl* STATUS: UNDER CONSTRUCTION & REVISION

1 What are WordFreak and the OpenNLP Tools?

WordFreak is a Java-based linguistic annotation tool designed to support manual human, and automatic annotation of linguistic data as well as to deploy active-learning for human correction of automatically annotated data. It provides basic functionality to check automatic annotation manually or create annotations into the interchangeable format XML. But if you want work with a more advanced set of tools than basic POS-Tags, you need to extend WordFreak. One set of tools, which is easy to implement, are the OpenNLP tools. (WordFreak, 2004).

This tutorial shall provide you with the first step of installation and a quick-start guide for the linguistic framework WordFreak. As it is recommendable that you use WordFreak in combination with other tools this guide as well provides you with a step-by-step guide to install the OpenNLP tools.

2 Current Projects

Due to its character as an open-source project, the creators of both these tools are interested in free of cost usage and community development and improvement. Thus, there are lots of smaller projects or private users which use WordFreak and OpenNLP on the regular basis for their annotation processing and adapt or improve them to their needs. One example for an implementation and development on the OpenNLP tools would be the IKS project, a semantics based open-source project, for small to medium CMS providers ( IKS, 2010), in which OpenNLP serves as an extraction layer for named-entities ( IKS-Wiki, 2010).

Another example for the combination of WordFreak with other annotation tools is the corpus of biomedical discourses “Mining the Bibliome”, which is mainly based on manual annotations or human checked annotation schemes ( Bibliome, 2010) The use of the specific register in this project makes it difficult to use complete automatic annotation tools.

3 System requirements

  • Unix (Mac OS X, Linux) or Windows (XP and newer) Operation Systems
  • Java Runtime Environment (tested with Version 6.2)

The following installation instructions use the Windows environment. If you use another operation system please adapt the following instructions to the needs of your operation system. A script template on how to bind OpenNLP to WordFreak and further instructions for Linux can be found at the following website, provided by Wilcock!

4 Installation

4.1 Download Of The Tool Sets

First of all you need the two packages of WordFreak and OpenNLP which can be downloaded in the current version at the corresponding websites: WordFreak and OpenNLP. Download as well the plug-in java files (jar-files) of the OpenNLP tools at the following WordFreak website.

WordFreak itself can be executed by double clicking on the jar-file. The OpenNLP tools can be used without a framework environment as well, but so far only through the command shell, for which purpose they have to be compiled first. If you want more information on how to use OpenNLP without WordFreak, visit the Readme file. In order to use the OpenNLP tools, you need to download the zip-archive and unzip it in the corresponding directories, as stated in the following step-by-step instructions.

4.2 Copy & Paste The Tools Into The Corresponding Directories

Save both tools into a new folder on your computer (preferably not on the Desktop1)), called WordFreak and OpenNLP. Extract the zip-file of the OpenNLP tools to a new folder and rename it to OpenNLP2)</em>. Copy the java file of the WordFreak tool into a new folder within WordFreak and OpenNLP and rename the folder to WordFreak. Create a new folder called plugins in the WordFreak directory. Move all the plug-in files you downloaded from the WordFreak website into this folder.

By now your folder structure should look like this and contain the following files:

root folder sub folder Content
WordFreak _OpenNLP:
WordFreak: plugin
WordFreak.jar
OpenNLP: extracted zip-archive

Your WordFreak _ OpenNLP directory should by now contain a folder named WordFreak with the WordFreak .jar file and the &ldquo;plugin&rdquo; folder, and a folder named OpenNLP with the extracted files from the zip-archive!!

4.3 Installing The Model Files For The OpenNLP tools

The model files for the corresponding languages can be downloaded from the OpenNLP website. To install the model files, create the folder &ldquo;models&rdquo; in the OpenNLP folder and move them to this folder preserving the folder structure according to the structure encountered on the server. For example, the !EnglishChunk.bin.gz file needs to be moved to the folder: &ldquo;…/models/english/chunker/&rdquo;. <p align=“left”>After installing the models you need to perform the extra fix for WordFreak OpenNLP Plugin to work:</p> <p align=“left”>- copy &ldquo;tag.bin.gz&rdquo; from &ldquo;…/models/english/parser&rdquo; to &ldquo;…/models/english/postag&rdquo; and rename it to &ldquo;EnglishPOS.bin.gz&rdquo;</p>

4.4 The Executable Batch-file

In order to be able to use the OpenNLP tools with the WordFreak environment you need to create a batch-file. Batch-files are frequently used to execute applications under the windows system environment. First of all create an empty file in the main directory WordFreak_OpenNLP and name it WordFreak.bat. Open this empty file with an editor like Notepad++ and copy and paste the content of figure 1 into the file3) and save it:

set CLASSPATH=%~dp0wordfreak\wordfreak-2.2.jar;
%~dp0wordfreak\plugins\opennlp-wordfreak-1.1.jar;
%~dp0wordfreak\plugins\opennlp-tools-1.4.3.jar;
%~dp0wordfreak\plugins\maxent-2.5.2.jar;
%~dp0wordfreak\plugins\trove.jar;
java  wordfreak -d %~dp0opennlp\models\english

Be aware that the batch-script will not work if you change the names of the containing folders of <em>&ldquo;</em>WordFreak<em>_</em>OpenNLP<em>&rdquo;</em>. You should furthermore check if the names of the files you downloaded have changed, due to updated versions, etc. You either have to change the script according to your directories and files or change the folder and file names as stated in the batch-file. The location of the <em>&ldquo;</em>WordFreak<em>_</em>OpenNLP<em>&rdquo;</em> folder however does not have any impact on the execution of the batch-script, as the script only looks for folders inside its containing directory.

5 Working With WordFreak & The OpenNLP Tools

The current chapter exemplifies the usage of WordFreak and the OpenNLP tools to add and check automatically generated POS-Tags on raw text files.

5.1 Functionality of WordFreak and the OpenNLP Tools

The functionality of WordFreak is basically limited to the functionality of its components, in this case OpenNLP. The OpenNLP tools support English annotations such as sentence detection, tokenization, POS-tagging, parsing, chunking and named-entity detection. Other languages currently in use with the OpenNLP tools, however, only support sentence detection, tokenization and POS-tagging. The main advantage of using WordFreak over other tool sets is that WordFreak allows you to add manually annotations or check automatically generated annotations, created by your own tag sets for instance.

With use of further complementary software you can extend the functionality of WordFreak and analyze English coreference resolution and named-entity recognition. Although automatic annotation is supported, the main concept of WordFreak was to create a tool to manipulate annotations or create them manually as done in the bibliome project (cp. chapter 2). Another tool which can be integrated into the WordFreak framework is Lingpipe, a named-entity recognition and coreference system for English, as well as it provides plug-ins for format support of the commercial corpora MUC-6 and MUC-7 and non-commercial tool-kit ACE. Although Lingpipe as a stand-alone tool supports a variation of annotation schemes as hyphenation and syllabification, clustering, Chinese word segmentation, database text mining, etc. the amount of annotations applicable for the WordFreak is limited to automatic named-entity and coreference annotation and an annotation scheme to display and edit these annotations.

5.2 Starting WordFreak

To start the WordFreak framework together with the implemented OpenNLP tools you need to execute the &ldquo;WordFreak<em>.bat&rdquo;</em> file by double clicking on it. The WordFreak main window opens:

WordFreak-mainwindow]] External Link to WordFreak <p align=“center”> Screenshot 1: The WordFreak main window </p>

5.3 Text Processing

After starting the application WordFreak creates an untitled project, which by default is not saved at any place. Thus, be sure to save your processed project every now and then to prevent data loss. The descriptions below refer to text files (i.e. &ldquo;article.txt&rdquo;). If you use other file types the steps below may not be applicable.

Before you can process a text, you have to tell WordFreak which raw text you want to work with and load it.4)

5.3.1 Add & Load A Text File

To add and load a text file click on the <span style=“background-color:#c1bdd7; border: solid 1px”>Add</span>-Button and select the directory in which you have stored your text file. By default the archive-type is set to <span style=“background-color:#c1bdd7; border: solid 1px”>TreeBank Files</span>. To be able to select other file types, you have to mark the corresponding file type from the drop-down list. In this example you need to select <span style=“background-color:#c1bdd7; border: solid 1px”>Text Files</span>. Select the file in the corresponding folder and click <span style=“background-color:#c1bdd7; border: solid 1px”>open</span> to add the file to your untitled project. Respond to the question in the appearing pop-up window with <span style=“background-color:#c1bdd7; border: solid 1px”>yes</span> if you want to create an annotation file. The annotation file is created in the folder of the corresponding text file.

You can add other files by repeating the previous steps or by holding the <span style=“background-color:#c1bdd7; border: solid 1px”>Control</span>-Button (<span style=“background-color:#c1bdd7; border: solid 1px”>STRG</span> on a German keyboard or the Apple-key) and select the files you want to add. After adding all the files, you need the application to load the added files and their annotation files in order to let it process them. Select the text file you want to be loaded from the WordFreak main window and click the <span style=“background-color:#c1bdd7; border: solid 1px”>Load</span>-Button from the right panel. You can see that the text- and annotation-files are loaded by the appearing green mark:

<p align=“center”><img width=“552” alt=“WordFreak” src=“http://linglit194.linglit.tu-darmstadt.de/linguisticsweb/pub/LinguisticsWeb/WordFreak-OpenNLP/LoadText-wordfreak.jpg” height=“438” border=“0” /></p> <p align=“center”> Screenshot 2: Load a text into WordFreak </p> WordFreak cannot be used to build pipeline processes, which means we have to process each step separately to add POS-Tags to a text. Thus, before a text can be annotated, it needs to be preprocessed with a sentence detector and tokenizer.

5.3.2 Sentence Detection & Tokenization

At the bottom of the main window there are several drop-down menus which display frequently used options for text processing (cp. screenshot 1). In a first step we need to select a sentence detector to tell WordFreak which algorithm we want to use for the sentence detection. You can either select the sentence detector <span style=“background-color:#c1bdd7; border: solid 1px”>Open Sentence</span> from the Tagger menu of the main window or from the Tagger menu in main menu bar on the top of the main window. Now start the tagging process by either selecting <span style=“background-color:#c1bdd7; border: solid 1px”>Tag</span> from the Tagger menu bar or pressing the icon below the main menu bar.

In a next step we select the annotation from the Annotation menu (<span style=“background-color:#c1bdd7; border: solid 1px”>Set annotation</span> from the main menu bar) to open an alteration panel for manual correction, in case needed for manual correction of the tagged text. For the sentence detection we need the sentence annotation.

This process needs to be repeated for the tokenization of the text. First select <span style=“background-color:#c1bdd7; border: solid 1px”>Open Token</span> from the Tagger menu and repeat the process.

5.3.3 POS-Tagging

As described before we select from the <span style=“background-color:#c1bdd7; border: solid 1px”>Tagger</span> menu the necessary attribute for the text to be processed. In this case we select <span style=“background-color:#c1bdd7; border: solid 1px”>Open POS</span> for apply the OpenNLP tool to our preprocessed text file. From the <span style=“background-color:#c1bdd7; border: solid 1px”>Annotation</span> menu we select again the OpenNLP tagger accordingly and let WordFreak do the tagging. By default WordFreak uses the Penn Treebank tag-set. If you want to use a different kind of tag-set, you can either install it and select it from the drop-down menu or use the tag-sets already installed.

In a next step we check on the automatically generated tags in our text. By selecting <span style=“background-color:#c1bdd7; border: solid 1px”>TextPOS</span> from the Viewer menu WordFreak opens a new tab in which we can see our text with the respective tags attached. In the example, shown in screenshot 3, there is an annotation missing for the word group &ldquo;according to&rdquo;. By clicking on &ldquo;PRP&rdquo; in the right panel we add the annotation to the previously selected word &ldquo;according&rdquo;. The particle &ldquo;to&rdquo; will be added automatically as a part of the preposition. <p align=“center”><img width=“706” alt=“WordFreak” src=“http://linglit194.linglit.tu-darmstadt.de/linguisticsweb/pub/LinguisticsWeb/WordFreak-OpenNLP/Annotation-wordfreak.jpg” height=“395” border=“0” /></p>

<p align=“center”> Screenshot 3: Change annotations in WordFreak </p>

If you want a vertical layout of the text and annotations you can select <span style=“background-color:#c1bdd7; border: solid 1px”>Tree</span> from the viewer menu, which displays a Treetagger like layout of the annotation.

The current instructions did not cover all possibilities of the WordFreak framework. Feel free to browse through the menus and discover further functions of these tools sets.

References

1)
Many tools in CCL do not respond too favourably to an installation to the Desktop. This is due to the fact that Desktop is only a virtual link and not a 'real location' in the file system which is not what most tools expect as their location.
2)
Throughout these instructions we use these names as examples. Of course these names can be changed according to your preference, but be aware to change the folder names in the executable script file later, too.
3)
Be aware that the classpath must not contain any linebreaks. Only the line in front of java WordFreak -d … should contain a line break.
4)
Created text annotations will not be saved within the raw text file, but in an extra annotation file in XML format. Thus, WordFreak will not make any changes to the original text file.