Spark on the CLMS cluster

Spark jobs may be submitted from any node; the master URL, if you need it, is spark://diana.local:7077, although most spark commands will pick this up automatically.

The job queue and node status may be monitored via the Spark Master's web interface, http://diana.ling.washington.edu:8080/.

General information on Spark, along with examples, can be found on the Apache Spark homepage.

There are also example jobs in /opt/spark/examples. For example, to run the Pi approximation example with 1000 rounds, you would type

spark-submit /opt/spark/examples/src/main/python/pi.py 1000

This will produce a lot of job information output, which may bury the output of the actual job, so you may want to split up stdout and stderr like so:

spark-submit /opt/spark/examples/src/main/python/pi.py 1000 >pi.out 2>pi.err

Hadoop File System

The files are not guaranteed to be spread across the cluster for use by Spark. A workaround is to place it in the HDFS:

hadoop fs -put local_dir_name my_dir

You can verify its contents like so:

hadoop fs -ls my_dir

And access it like so:

"hdfs://diana.local:9000/user/" + "username/my_dir"

The uri shown in ls does not work:

"hdfs://" + /user/jcadigan/en_ar_data/fr_eng_1M.eng.true

Logs for debugging

In addition to the stderr output of the Spark job, it may be useful to look into the per-executor logs for each run available in:

/opt/spark/work/APP#/EXECUTOR#

Topic revision: r3 - 2016-09-24 - 21:09:02 - jcadigan
 

This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions