Hadoop Example using WordCount

In this example, we'll run the WordCount example that comes with Hadoop on our local copy of the Brown Corpus.

1. Make a directory in the Hadoop Distributed Filesystem (HDFS) to hold the project:

$ hadoop fs -mkdir brown

Note that, by default, Hadoop assumes all paths that don't start with a / are relative to /user/username, where "username" is your patas username.

2. Copy the corpus data into HDFS. Note that -put will automatically create the destination directory, so we don't have to make it ahead of time.

$ hadoop fs -put /corpora/ICAME/texts/brown1 brown/input

You can skip this step and run jobs against non-HDFS paths by prefixing the complete path with "file://", for example "file:///corpora/ICAME/texts/brown1" However, this loses the speed advantages of distributing the data among the compute nodes.

3. Launch the WordCount map-reduce job. Note that the output directory will be created automatically, and in fact it's an error if it already exists.

$ hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output
12/03/23 12:41:38 INFO input.FileInputFormat: Total input paths to process : 15
12/03/23 12:41:39 INFO mapred.JobClient: Running job: job_201203211437_0002
12/03/23 12:41:40 INFO mapred.JobClient:  map 0% reduce 0%
12/03/23 12:41:54 INFO mapred.JobClient:  map 6% reduce 0%
(...rest of the output snipped for brevity)

4. We can now find the results in our output directory:

$ hadoop fs -ls brown/output                            Found 3 items
-rw-r--r--   3 brodbd supergroup          0 2012-03-23 12:42 /user/brodbd/brown/output/_SUCCESS
drwxr-xr-x   - brodbd supergroup          0 2012-03-23 12:41 /user/brodbd/brown/output/_logs
-rw-r--r--   3 brodbd supergroup    1123352 2012-03-23 12:42 /user/brodbd/brown/output/part-r-00000

From here, we can view the output file directly with hadoop fs -cat brown/output/part-r-00000, or transfer it back to our local filesystem with something like hadoop fs -get brown/output/part-r-00000 brown-results.txt.

Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem.

Cleanup can be done with the -rm or -rmr commands, which are equvalent to the shell commands "rm" or "rm -r". For example, to remove our entire project, we could do: hadoop fs -rmr brown

