Tuesday, January 4, 2011

CUDA, OpenCL, Cilk, and ParaSail

NVIDIA, AMD, and Intel have all gotten interested in parallel programming languages over the past few years, and for good reason.  It is clear that all programmers will have to start learning how to take advantage of parallelism if they expect their programs to continue to benefit from Moore's law, with its wonderful doubling of computing power every two years.  And chips that make it easier to take advantage of parallelism will be the more successful chips. 

NVIDIA has been championing CUDA as the way to use their GPUs for general purpose parallel processing (GPGPU), though they have also been part of the group designing OpenCL, which is the more hardware-independent approach first developed by Apple and now being standardized by the Khronos group.  AMD (after swallowing ATI) started with CTM (Close to Metal) and some other approaches, but now seem to be putting most of their energy behind OpenCL for GPGPU programming.  Intel has been encouraging a number of different approaches, including Cilk, Threading Building Blocks (TBB), Concurrent Collections (CnC), etc., and they are part of the OpenCL effort as well.

Cilk and OpenCL seem to be emerging as two important, rather distinct approaches to parallel programming.  Cilk makes it easy to spawn threads off to compute an arbitrary expression, and then uses a set of worker processes to service the threads, using the approach we discussed in an earlier post, where threads spawned by a given worker are serviced LIFO, but when a worker runs out of threads, it steals from other workers using a FIFO strategy.  OpenCL, on the other hand, is focused on data parallelism, where many data items are all being manipulated in nearly the same way.  It is roughly a SIMD (Single Instruction Multiple Data) approach, though some variation in control flow is permitted, meaning that it is somewhat closer to the SPMD (Single Program, Multiple Data) approach fostered by the Connection Machine folks.  In OpenCL (and CUDA), the per-data-item code is bundled into a kernel, which is essentially a function that operates on a single data item in a large array or stream of such items.

So how does this relate to ParaSail?  In some ways, Cilk and OpenCL can be seen as lower-level languages, where the programmer has to worry about identifying particularly places where threads should be spawned (in Cilk), or worry about extracting kernels from code that operates on an array or stream (in OpenCL).  ParaSail on the other hand is focused on expressing the algorithm at a relatively high level with pervasive opportunities for parallelism, presuming the compiler or run-time support routines can identify how best to map all of the opportunities for parallelism on to the underlying parallel hardware.

One of the key advantages that ParaSail provides is that there is no hidden aliasing, unlike in C, C++, Java, etc., where aliasing is everywhere.  (Cilk and OpenCL are both piggy-backed on C/C++; various threading APIs are available to Java, C#, Python, etc.)  This makes it dramatically easier for a compiler or run-time to automatically partition computations into pieces to be executed in parallel.  Furthermore, the ParaSail compiler detects all race conditions (and various other problems) at compile time, has no exceptions, allows memory to be partitioned into non-overlapping regions, etc., which simplify both the programmer's model and the process of automatic parallelization.


  1. Have you taken a look at Intel's Array Building Blocks (ArBB)? ArBB is an interesting approach to hiding automatic parallelization and vectorization behind a higher-level interface (in this case, C++, though their software architecture makes me think that other language bindings are possible). It's supposed to help with vectorization especially (which on CPUs used to involve painful assembly-language programming and was not at all future-proof). Future-proof is a big deal because SIMD vector lengths are getting longer, and instruction sets are changing rapidly.

    Anyway, check out the following blog posts on vectorization-related programming models:



  2. Thanks for the pointers. I have glanced at Intel's Threading Building Blocks (TBB), but not at ArBB. Because ParaSail allows for user-defined indexing, etc., I would guess much of the ArBB work might apply. I will take a more detailed look to see whether that might be true. I also took a look at your CPU/GPU blog entries. It is interesting to see how the GPU model is performing some "creative destruction" on old CPU-oriented thinking.

  3. GPUs are just a gateway drug ;-) CPUs aren't going away, though. GPUs are bandwidth (/throughput) - oriented, CPUs are latency-oriented (and are evolving that way by backing off on the flops and energy consumption).

    Maybe the real challenge is whether one programming model can or should cover both kinds of computing.

  4. Hi,
    Is it possible to use Cilk with CUDA on a GPU cluster?

  5. I once asked a Big Data person why they didn't use GPUs for large-scale parallel operations. The answer was that it takes too long to get the data in and out of the GPU compared to having the CPU compute it.