Driving CUDA over the Grid
There are two great principles to Mathematica‘s parallel computing design. The first is that most of the messy plumbing that puts people off grid computing is automated (messaging, process coordination, resource sharing, fail-recovery, etc.). The second is that anything that can be done in Mathematica can be done in parallel.
With this week’s release of gridMathematica 8, which adds the 500+ new features of Mathematica 8 into the shared grid engine, one nice example brings together both ideas—and that is driving CUDA hardware, in parallel, over the grid.
The use of GPU cards for technical computing is a hot topic at the moment, with the promise of cheap computing power. But the problem for both CUDA and OpenCL has been that they are relatively hard to program; they are well within the abilities of an expert programmer, but many of Mathematica‘s users are principally scientists or engineers, not programmers.
Mathematica 8 introduced a much improved workflow for controlling GPU cards by automating as much of the communication, control, and memory managements as possible (as detailed in this white paper Heterogeneous Computing in Mathematica 8). Exactly the principles we previously applied to CPU-based parallel computing.
So now we have a system that can automatically run any code in parallel, including code that automatically manages GPU hardware. And with the release of gridMathematica 8, we can automatically distribute those tasks over remote hardware. Here is how easy it can be.
Imagine that my organization has just purchased 50 regular PCs, each with a single 512 core GPU on board and each connected somewhere on my network. I want to evaluate a financial instrument a half a million times and I want it to be fast, so I want to use all of those 25,000 cores. (Note that this trivial example is only meant as an illustration.)
Step 1: Call my system administrator and ask for permission to use the hardware, and also ask that he installs gridMathematica and Wolfram Lightweight Grid Manager on the cluster. I will only need to do this once (if I have a good system administrator!).
Step 2: Tell my normal Mathematica installation to use this cluster when it is given parallel tasks. It takes three clicks and typing the number of kernels I want from each machine. If that sounds hard, watch a screencast I made doing just that. I will only need to do this once.
Step 3: Write the parallel, CUDA-enabled code to break the task up, distribute each subtask to each remote PC, place it onto its GPU card, run it there, take the result off the GPU card, return the values back to my local PC, re-allocate tasks (should a machine crash or otherwise go offline), and coordinate them into the result set. Sound hard? Here is the code.
Most of the code is describing the valuation parameters. All the complicated work is automated behind the first two lines. Notice how the code doesn’t say where the remote hardware is, or how many subtasks each machine has to run. I can share that code with someone with a different cluster of PCs, and it will scale automatically.
Of course, this is a trivial example, with entirely independent evaluations and predictable hardware. If I want to run my own CUDA or OpenCL programs, instead of built-in, high-level CUDA-enabled commands, then it takes one extra line of code plus my CUDA code. And if my PCs have more than one GPU, then there is one more line of code needed to allocate which GPU the task runs on. But little, in principle, changes for more complex real-world problems.
In a way, this blog post is similar to the way that we want technical computing to be. 650 words of ideas wrapped up in 7 lines of code. Look at any other solution for doing this and you will find the ratio of code required to be much higher.