Using Hadoop on the Patas cluster

General Info

Hadoop is a processing framework that allows for scalable, distributed processing. It includes a distributed filesystem (HDFS) and a distributed processing framework (MapReduce). Unlike Condor, which can schedule any type of job, Hadoop jobs must be written specifically to work with the MapReduce framework. However, for jobs that are well-suited to it, it automates some of the tasks you'd otherwise have to do with your own code in a Condor job.

Local installation details

Hadoop is installed under /opt/hadoop/bin. This directory is on the system path so you can run Hadoop commands without specifying this directory. You will, however, need to add /opt/hadoop to your Java CLASSPATH when building Java code to run on Hadoop.

HDFS directories are layed out somewhat differently than on our local filesystems. Instead of /home2, Hadoop user directories are under /user; i.e., if your NetID is "jdoe", you have a Hadoop user directory under /user/jdoe.

Job tracking

To see the current job tracker status, visit the Job Tracker Web GUI.


  • Official Hadoop documentation -- somewhat terse, but a good starting point.
  • HadoopWordCountExample -- a simple example of how to run a parallel job on our cluster, including copying the data to HDFS and extracting the results.
  • The "hadoop" command, if run by itself, will give simple usage instructions. This also applies to submodules; e.g., "hadoop fs" will list all the commands accepted by the HDFS module.
  • Seeing the Bars of the Hadoop Cage -- advice on how to write Hadoop jobs without locking yourself into the Hadoop model.
Topic revision: r2 - 2013-05-16 - 00:20:03 - brodbd

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions