Parallel Processing

One of the challenges of working in natural language processing is the large amounts of data that must be processed to get meaningful results. After you've done everything you can to make your program run quickly--written efficient algorithms, bought powerful hardware--it may still take you hours or even days to get a result. This is where parallelization comes in.

Parallelization is a technique in which you break your problem down into smaller pieces that can be run simultaneously on multiple machines. It usually consists of three parts:

  1. Dividing the taks into independent parallel subtasks
  2. Running the subtasks simultaneously
  3. Coallating the results

The first and last steps entail work for the programmer to make sure that the problem is properly modularized and the parallelization code (if any) is properly written. The second step can be accomplished manually by walking from machine to machine and kicking off processes, but is greatly facilitated by having that task automated by parallel processing software, like Condor. In more sophisticated jobs, you may want to automate the first and last steps as well, using DAGMan to tell Condor which jobs rely on other jobs.

Parallelization is a challenging programming technique in its own right; your specific parallelization technique will vary from task to task. The most important thing to keep in mind is an awareness of the logical dependencies between different parts of your program. For example, say you have a slow parser, so that running it on a test set of 10,000 sentences takes about a day. Since parsers work on sentences independently of each other, you could break the test set up into 10 1000-sentence inputs and run them all in parallel. Done correctly, this could give you up to a 10-times speedup, so that your task would finish in a little under three hours.

The high-performance computing cluster maintained by the University of Washington Linguistics department manages a cluster of parallel compute nodes using the Condor scheduler. See HowToUseCondor for details of how to describe your job to Condor and submit it.

This topic: Main > ParallelProcessing
Topic revision: r2 - 2011-09-13 - 18:11:20 - brodbd
This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Privacy Statement Terms & Conditions