General Info

Mahout is a big data machine learning toolkit that can run on top of Hadoop. At the time of writing it is rather unstable, so here are some tips on tweaking the samples so you can get started.

Local Installation and Setup

Mahout is installed under /NLP_TOOLS/ml_tools/mahout/latest/ . However, some of the samples on the Mahout website require write access to the mahout installation directory (sigh), so you'll want to pull down the install into your home dir:

cp -r /NLP_TOOLS/ml_tools/mahout/mahout-distribution-0.6/ ~/tools/

Once you have a copy, build the examples like so

cd tools/mahout-distribution-0.6/examples

mvn compile

Set these environment variables.

export HADOOP_HOME=/opt/hadoop
export MAHOUT_HOME=~/tools/mahout-distribution-0.6

Mahout tries to run on Hadoop by default. To disable that and run locally, set this variable to anything:

export MAHOUT_LOCAL=blah

Running Examples on 0.6

The package structure and samples are changing with each release, so here are two that have been tweaked to get them working on 0.6.

Random Forests

Original broken sample.

$ curl -o

$ hadoop fs -put rdftest/

$ hadoop jar ~/tools/mahout-distribution-0.6/core/target/mahout-core-0.6-job.jar -p rdftest/ -f rdftest/ -d I 9 N L

$ hadoop jar /home2/megallo/tools/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar org.apache.mahout.classifier.df.BreimanExample -d rdftest/ -ds rdftest/ -i 10 -t 100

There's no documentation for the input format, except that it says it conforms to the UCI format (which is not described anywhere on their website, thanks guys). So my disclaimer is that the following things were figured out by messing around.

Rules on the input vectors are thus:

  • one line per document
  • attributes must be in the correct order
  • missing attribute placeholder is a question mark
  • attributes cannot contain spaces or commas
  • delimiter can be either comma or space
  • numeric values can contain periods

To tell it how to read the input file so it can convert to vectors, it uses the Describe class. You pass in a nigh-unintelligible string of numbers with these characters:

  • N : numerical attribute
  • C : categorical (nominal) attribute
  • L : label (nominal) attribute
  • I : ignored attribute

I 2 C 3 N C C L == ignore first item, read in two alphanumeric values followed by three numeric values followed by two alphanumeric then label

It will parse the data file and make sure it conforms to the descriptor, then writes the descriptor out to a file. Then when you call the BreimanExample class it takes the data file, the descriptor, the number of iterations, and the number of trees. I discovered that it will give you NaN error values if you try too many trees with not enough data points.

I'm planning to write a class to invoke the DecisionForest code. The BreimanExample is an okay start, but it doesn't write the model to a file for future classification, plus I need a class to invoke TestForest with the model and give back accuracy info. I'll post it here at the end of the quarter.


Here is the sample on the Mahout website, and this shell script is the actual script that has been updated.This one is nicer because it will build the vector files for you from a directory of text files.

As of 0.6, this will not run on Hadoop. You will need to set your MAHOUT_LOCAL.

On this line, add -nv and -ow:
./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse -nv -ow

Here, add -ow and -cl so that it will give you the cluster output file:

./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 100 -k 20 -ow -cl

Now change the input cluster filename to use clusters-*-final, because this number is variable.

./bin/mahout clusterdump -s ./examples/bin/work/reuters-kmeans/clusters-*-final -d ./examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile -b 800 -n 20 -o ./examples/bin/work/cluster_output.txt

However, the script stops there. It doesn't actually give you a way to see what original docs went into which clusters, so here's a class that will do that. I pulled this Java class down from the web; it's apparently similar to the Mahout in Action book, but since that's for version 0.4, it doesn't work any more. Here is the fixed Java class. Please note that you will have to keep adding jar after jar to your classpath for it to run. Most of what you need is somewhere in the /opt/hadoop/lib directory, including Apache Commons stuff. Here is the class all fixed up, and here is how to invoke it:

java ClusterOutput ./examples/bin/work/reuters-kmeans/clusteredPoints ./examples/bin/work/cluster_vectors.txt ./examples/bin/work/cluster_ids.txt

Related topics: TWikiUsers, TWikiGroups, TWikiAccessControl

Topic attachments
I Attachment Action Size Date Who Comment
javajava manage 2.8 K 2012-04-27 - 22:46 UnknownUser  
Topic revision: r1 - 2012-04-27 - 23:00:33 - megallo

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions