To create the CoNLL-2003 training and test files for machine learning, you have to download the annotation files from the Shared Task's webpage. You furthermore need the ECI Multilingual Textcorpus as the Shared Task only provides the named entity annotations but not the text data.
When you have the ECI Multilingual textcorpus and the CoNLL-Shared Task annotation data, you can use a file provided by the Shared Task to merge the texts and the annotations. This file has only been tested for Linux systems, so you have to run it under Linux. First, you have to extract the CoNLL data files from the tar file you downloaded from the webpage.
To build the English language files, you have to insert the first cd of the Reuters corpus into your computer and mount it by typing mount /mnt/cdrom
in the command line. Then run the extraction file from the ner directory: cd ner
, then bin/make.eng
. The training and two test files will be generated in the ner directory.
To create the German language files, insert the ECI Multilingual Text corpus cd in your computer and mount it by typing mount /mnt/cdrom
. Then run the extraction file for the German lanugage from the ner directory: cd ner
, then bin/make.deu
. The trainings file and the two test files will be generated in the ner directory.
In case you want to store the data elsewhere and you don't have it in your cdrom directory, open the file make.deu or make.eng in the bin directory. In line 8, the location of the file ger03b05.eci is specified. Change this line to the location the file is in on your system. You furthermore have to change the line 10 to your location. Then save the file and run it as explained above. This will create the three files with the the text and the annotation.