This processing scenario shows how to do named entity recognition with machine learning approaches using the software Weka. An introduction to the basics of Weka can be found in the Weka tutorial.
There already exist different tools for named entity recognition, such as the Stanford Named Entity Recognizer or the ANNIE tools in GATE. But it may proove to be difficult to find a good system for other languages than English. If you want to work on named entity recognition it may therefore be necessary to build your own model(s) by using machine learning approaches. This may be especially useful for other languages than English and for other named entities than the “standard” named entities such as person, location, organisation and miscellaneous (comprising for example titles (films, books), languages or adjectives derived from other named entity categories).
For machine learning, you need, depending on the task and the algorithm, a large amount of training data. Training data for named entity recognition means that you have texts or sentences already annotated with named entity tags (for example the corpus of the CoNLL-2003 Shared Task (German and English annotations)). You furthermore need a development and a test set which are also already annotated with named entities. If your corpus is not already divided into these data sets you have to split them yourself. You can either use already annotated texts, but sometimes you want to test named entity recognition on other texts and other genres, for example on web texts. In this case, you have to annotate your own data. When starting your project, keep in mind that the annotation process needs time and be careful to define annotation guidelines you (and maybe other annotators) can refer to. This is needed to provide transperence on your results and on how you proceeded to annotate the different named entity categories.
You can use your data to derive your feature set from it. A feature set is read into the machine learning software and used by the algorithm to find rules for the identification of named entities. Examples of those features may be PoS tags, chunk tags, prefixes or suffixes, orthographic features (is the first letter capitalised etc.) or a comparison with a gazetteer (a list of named entities). You can create such a feature set by using programming languages such as Python. You do not only have to transform your training data into this feature set, but also your development and test data because all the data has to be in the same format and because you have to provide the software with the same information on the test set as on the training set.
If you want to use the software Weka, you have to create your feature set as csv file because Weka only reads in csv or arff (Attribute-Relation file format) files. Once you have created the csv file, you can use the csv loader to read it into Weka (for more information on the usage of Weka and the csv loader, please refer to the Weka tutorial of the linguisticsweb. If you have not split your data into training and test set, you can also do it in Weka choosing a percentage of your data as test data.
When you have created your feature set, it may contain many features. But some algorithms give better results when you do not work with a large amount of features, but rather with some very helpful features. To find these features, Weka offers an attribute evaluator which can rank you features so that you can exclude less helpful ones (for more information on the feature evaluator, please refer to the Weka tutorial.
For named entity recognition different algorithms have already been tested and used. Some of them are Support Vector Machines (implemented in Weka as LibSVM and SMO), Maximum Entropy, decision trees such as C4.5 (implemented in Weka as J48) and boosting approaches, for example the algorithm AdaBoostM1.
Once you have chosen a classifier, you can start the training process. How to choose the classifier and save the resulting model is explained in the Weka tutorial on linguisticsweb. You can test your models with your development set (or later your test set) and save the results. As a result, you get a confusion matrix and a table with different values such as precision, recall and f-measure for your different named entity categories.
Example output of the evaluation of the development set
The evaluation of your model on the test data allows you to compare your model (your algorithm, its settings and the chosen features) to other, already existing models. If you did named entity recognition on German texts with the categories person, location, organisation and miscellaneous, you can compare your results to the ones of the CoNLL-2003 Shared task on multilingual named entity recognition (German and English). You can find more information on the shared task in the linguistics web article or on the Shared Task's web page.