Mallet is a Java-based machine learning toolkit that enables a number of typical tasks for the exploration of large sets of unlabeled data. It is widely used in exploratory approaches in linguistics such as for example topic exploration in text corpora.
Mallet does not require much of an installation in the sense of running an executable installer. Rather, mallet
is set-up by extracting the mallet zip file.
Download the file from the mallet website: http://mallet.cs.umass.edu/download.php
Two versions of the mallet download are available:
mallet-2.0.8.tar.gz
for Linux / Mac OS usersmallet-2.0.8.zip
file for Windows users
Download the file that is appropriate for your operating system and extract it. Windows users should extract mallet to the root of the main drive C:\
like this:
c:\mallet-2.0.8\
Also set the environment variable MALLET_HOME
so your operating system can find the path to mallet in your control panel under System –> Additional System Settings –> Environment Variables …
under System variables
by creating a New …
variable MALLET_HOME
with the path to the mallet directory as its value: C:\mallet-2.0.8\
Mallet has no graphical user interface, so will be run from the command-line in a terminal. From anywhere on your system, run the first command (line 3) from the command prompt / terminal to test that mallet has been properly set up on your machine:
# testing the installation by calling mallet help # from the terminal in the directory C:\mallet-2.0.8 C:\bin\mallet import-dir --help
If all goes well, you should have a long list of help options printed to your screen and no error messages.
Then it's time to look into the first step of the topic modelling process itself which entails importing and initializing the data, in our case set from the example data (UTF-8 encoded plain text files in English en
and German de
) that ship with the mallet installation and are located in the directory sample-data\web\en
. Let us go through the command in line 2 step by step. Please look at the command and read the text beneath before running it:
# importing and initializing the sample data
C:\mallet-2.0.8\bin\mallet import-dir –input “sample-data\web\en” –stoplist-file “C:\mallet\stoplists\en.txt” –output “C:\Users\Public\tm_output\tutorial.mallet” –keep-sequence –remove-stopwords
The mallet command is followed by the command import-dir
followed by the parameter switch –input
which requires information on the directory where the input data is located; this is followed by the optional parameter –stoplist-file
followed by the directory where a list of stopwords to be excluded from the topic modelling is declared (note that example stopword lists also ships with mallet); after that you declare an existing directory as output directory with the parameter –output
(this directory has to be created by you in a location of your choice; make that an empty directory so you can easily keep track of the files being written by mallet); the next parameter keeps the sequence of the data in tact; the last parameter –remove-stopwords
tells the process to remove the stopwords declared in the stopword list. Now the data is ready for the actual training of the topic model.
# training a model on the sample data # write this command to a batch file for easier parameter tweaking
C:\mallet-2.0.8\bin\mallet train-topics –input C:\Users\Public\tm_output\tutorial.mallet –num-topics 20 –optimize-interval 20 –output-state C:\Users\Public\tm_output\topic-state.gz –output-topic-keys C:\Users\Public\tm_output\tutorial_keys.txt –output-doc-topics C:\Users\Public\tm_output\tutorial_composition.txt
The parameters driving the example above are the following
–help | shows help options |
–input | specifies the directory where your input files are located |
–stoplist-file | specifies the stopword list for the respective language to be processed |
–num-topics | parameters for the number of topics you want to generate, in the example above, 20 topics are generated |
–optimize-interval | |
–output-state | directory to which mallet writes output |
–output-topic-keys | directory to which mallet writes output topic keys |
–output-doc-topics | directory to which mallet writes output document topics |