Spark on the CLMS cluster

Spark jobs may be submitted from any node; the master URL, if you need it, is spark://diana.local:7077, although most spark commands will pick this up automatically.

The job queue and node status may be monitored via the Spark Master's web interface,

General information on Spark, along with examples, can be found on the Apache Spark homepage.

There are also example jobs in /opt/spark/examples. For example, to run the Pi approximation example with 1000 rounds, you would type

spark-submit /opt/spark/examples/src/main/python/ 1000

This will produce a lot of job information output, which may bury the output of the actual job, so you may want to split up stdout and stderr like so:

spark-submit /opt/spark/examples/src/main/python/ 1000 >pi.out 2>pi.err

Hadoop File System

The files are not guaranteed to be spread across the cluster for use by Spark. A workaround is to place it in the HDFS:

hadoop fs -put local_dir_name my_dir

You can verify its contents like so:

hadoop fs -ls my_dir

And access it like so:

"hdfs://diana.local:9000/user/" + "username/my_dir"

The uri shown in ls does not work:

"hdfs://" + /user/jcadigan/en_ar_data/fr_eng_1M.eng.true

Logs for debugging

In addition to the stderr output of the Spark job, it may be useful to look into the per-executor logs for each run available in:


Topic revision: r3 - 2016-09-24 - 21:09:02 - jcadigan

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions