Conway's Game of Life is one of the most popular code katas, and the one we will exercise on our next code retreat this Saturday. I have done this kata, quite a number of times, but this time there was a twist: I used the GPU on my Nvidia GTX 560 to calculate the steps of evolution, and to draw them as well. Here is how the result looks like:
The igniter of this exercise was that I read a book on CUDA programing, and it was quite a disappointment. The examples were really far from anything I would call clean code, or at least easy to understand: it was spaghetti with huge meatballs in it. That kind of examples require a lot of explanation, which basically took up most of the books volume, and there was little left for the useful stuff. So I got curious: is it CUDA that promotes bad code, or the writers of the book didn't care enough? I had a preconception that there was nothing wrong with CUDA - and actually it's a pretty neat language - so I set out on a quest to code Game of Life in CUDA, and try to make it as clean as possible. I don't claim, that my code is perfect, but I hope it's going to be an easy read for anyone who tries.
You can download the result of my quest by clicking here, or just browse it as HTML by clicking here. The rest of the blog entry - basically a retrospective of the project, and a few remarks on CUDA features - might not make much sense without looking at parts of the code. If you'd like to compile, than you will need the nvcc compiler for CUDA, the CUDA Toolkit, GLUT and GoogleTest to run my unit tests. Please note, that the code is not production quality: it may well throw exceptions at you, it's only tested on Ubuntu, and it is not performance optimized in any way. (Even the algorithm is the most naive one.)
Let's start with the candy, shall we? We all heard legends about how fast GPU computing can be. Nvidia markets that really well. They even hired the mythbusters to make a fancy little demonstration about how amazing the GPU was.
But in the end marketing is just for the business guys, for us - tech freaks - all that really matters are the numbers. I had an earlier GameOfLife implementation in C++. It was a bit different in design, but the algorithm was the same, so it's fair to compare them. The test field was a 1000x1000 world. The CPU version on my AMD Athlon(tm) 64 Dual Core Processor 4200+ calculates an evolution step in 0.695 seconds average. On the other hand the GPU version running on my Nvidia GTX 560 finished the same task in 0.001 seconds average: that's about 700 times faster. I guess that's a right time to say "Boo-yah!".
But be aware: the naive implementation of GoL is inherently parallel. It's like it has be designed with a GPU in mind. Most algorithms used today are sequential, and turning them into a parallel one may not be trivial. So before you run to the closest computer store to buy a CUDA enabled video card, look at your problem carefully, and make sure that you can split it up into smaller problems that can be solved individually.
Writing GoL in CUDA was a good practice in refactoring. At first glance CUDA looks like a very low level procedural language strongly based on C, so I started out with one big class for the host code, one for the GLUT calls, and a big ugly kernel function for the device. My first stab at the problem didn't look that much better than the hated examples of the book. However pretty quickly one starts to explore the object oriented features, and it turns out, that almost everything can be shoved into objects, and even the inevitable cudaMemcpy, cudaMemset and cudaMalloc/cudaFree calls can be hidden behind the scenes by separating responsibilities.
Good thing that I had the tests to back me up while I relentlessly took apart everything:first I extracted methods, and renamed variables until each line read like a book. Next I extracted drawing logic from the GLUT wrapper class into it's own object. Than it became apparent that the GLUT wrapper still has multiple responsibilities: it quickly emitted the CudaSharedGLBuffer class that is responsible for the shared resource for CUDA-GL interoperability. At this point the wrapper class was clearly a display, so I renamed it to GlutDisplay.
GlutDisplay is just kind of a class. The problem is that GLUT, is written in C, and as such it has no idea about things like function objects. That meant that I had to add static functions to it to use them as callbacks. Since I already had static functions anyway, I made GlutDisplay a singleton. I'm pretty much against singletons in general - it's just a fancy global, and just as evil - but I don't really see any good alternative here. You may also spot, that there are no tests for classes that have been extracted from the GLUT wrapper. That is partly due to the fact that it's a singleton and tests don't like globals. The other reason is that I don't really see any reasonable way to test if GLUT displays what is expected.
Once the GLUT part was clean, my attention turned to the Game class. The first thing I did was to remove the loader functions from it into a builder class: GameFileLoader. That quickly turned into a class hierarchy with a child for each format. Pretty basic stuff! While I was at it, I also introduced a convenience template function: loadGame<FILE_FORMAT_CLASS>(Game, IStream).
The last refactoring step was the biggest. I had to separate the resource managing code, and the business logic of Game of Life by an extract class refactoring. SyncedSwapBuffer may still be doing more than it should be, but at this point I didn't see any clean way to chop it up. By the way the GameOfLife class already kind of looks like it has envy for some of the features in SyncedSwapBuffer. But a little feature envy is acceptable in return for the clear separation of responsibilities.
While refactoring I came by limitations of CUDA that could be easily eliminated. For example it would be nice, if one could have a device and a host function with the same name, but different implementations. For example one could write accessors for device data, and host data with the same name.
I also miss DeviceToHost and HostToDevice copy constructors. I want to hide memory management entirely from users of my container class. With a HostToDevice copy constructor I could sync the data on the host to the device only when it changed, and the DeviceToHost constructor would just set a flag, that reminds all host accessors to copy data back if they need it. Put that together with the over-loadable accessors suggested earlier, and the client of the container class will never find out about my lazy attitude towards data synchronization.
Finally one can not make device kernel entry points (functions declared as __global__) class members. The object on which the function is invoked is really just a function argument. The only difference, is that it is always passed by reference, and not by value, while kernel calls can only pass arguments by value. Fair enough, it may be confusing at first, but if a class has both a copy constructor and an assignment operator implemented, than the behavior of member function calls could be mimicked.
My first and most important advice for you is this: use RAII objects for resource management. CUDA doesn't come with nicely designed STL containers (although there are libraries available that contain such objects) so it's awfully easy to fall into the trap of writing C like code with lot's of malloc's and other memory management instructions scattered around in your code. On the other hand you will use C++ features (most notably exceptions) that make that kind of code vulnerable. If you need device memory, than malloc it in a constructor, and free it in the destructor. If you need to lock an OpenGL buffer for your CUDA kernel than lock it with a RAII object, so that it is released back to OpenGL after your kernel exits. And so on...
When writing object oriented code in CUDA, there will be a point where you want to call a member function as a device kernel. As mentioned earlier, you can't! Well... at least not straightaway. The trick is to implement the copy constructor for your class, and pass it to the kernel as the first argument. Your kernel won't do anything else, but call your member function with the remaining parameters. The trick is getting the copy constructor right.
In most cases your class will have a SMART pointer to the data on the device that you'd like to manipulate. (Smart is emphasized for a reason.) This smart pointer will need to be a reference counting one, since you'll copy the class, but keep a single copy of the data in device memory. If you don't reference count, than the moment your kernel returns you are short of the data you intended to copy back to the host in the first place. However be careful with reference counting smart pointers: if they gather up in a ring, they start leaking stuff quicker than Julian Assange. So when using shared pointers - not just in CUDA - always have a clear hierarchy of objects that use it, and make sure that each shared pointer points down at least one level.
Try to be lazy with synchronizing the data between the host and the device. In my Game of Life implementation you can spot a way to implement that.
When it comes to Test Driven Development with CUDA, there isn't any good way to directly test device functions. (Unless you know about a good CUDA unittest framework... if so, please let me know about it!) So the best way to go about this, is treating the device functions as private functions. Here it becomes even more important to take TDD seriously: you implement a little more than necessary to pass the test once, and you can be sure that the extra feature will not be tested until it breaks, and debugging it gives you a headache.
Challenges for the reader
Have you ever wondered why CUDA doesn't support recursion? Did you realize that all device functions have to be available in the same compilation unit when a kernel is compiled? Can you guess the reason? Will the answer make you do more extract function refactoring on device code, or you'll be more cautious with that from now?
In my implementation I did not have a separate Cell class, although it may come natural with clean design in mind. Why isn't it natural to do that with CUDA? If you already had a Game of Life implementation that expects to access the cells as individual objects, could you replace the GameOfLife object without changing the rest of the code? (I.e. mimic the interface that used to expose a cell object.) How would you do it? Try changing the interface of my GameOfLife class like that!