TWiki> Main Web>HadoopWordCountExample (revision 1)EditAttach

Hadoop Example using WordCount

In this example, we'll run the WordCount example that comes with Hadoop on the Brown Corpus.

  1. Make a directory in the Hadoop Distributed Filesystem (HDFS) to hold the project:
    hadoop fs -mkdir brown
    Note that by default, Hadoop assumes all paths that don't start with a / are relative to /user/username, where "username" is your patas username.
  2. Copy the corpus data into HDFS:
    hadoop fs -put /corpora/ICAME/texts/brown1 brown/input
    Note that you can skip this step and run jobs against non-HDFS paths by prefixing the complete path with "file://", for example "file:///corpora/ICAME/texts/brown1" However, this loses the speed advantages of a distributed filesystem.
  3. Launch the WordCount map-reduce job:
    hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output
    Note that the output directory will be created automatically; in fact, it is an error if it exists.
  4. Take a look at the results:
    hadoop fs -ls brown/output
    hadoop fs -cat brown/output/part-r-00000
    You could also retrieve output to the local filesystem:
    hadoop fs -get brown/output/part-r-000000 brown-output.txt

Keep in mind HDFS is not backed up, so it's best to retrieve any data you want to keep to the local filesystem.

When you're done, you can clean up anything you no longer need with the -rm (equivalent to the "rm") or -rmr (equivalent to "rm -r") commands:

hadoop fs -rmr brown/input

-- Main.brodbd - 2012-03-21

Edit | Attach | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2012-03-21 - 22:29:46 - brodbd
 

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions