Over the last few months, I’ve had an opportunity to spend some time playing with OpenCL. In short, we’re trying to use a GPU to accelerate garbage collection for Java. (Once the work is published, I’ll post more here.) We’ve implemented a simple graph traversal algorithm on an AMD chip using OpenCL. This article doesn’t talk about that effort directly, but instead focuses on a few of the lessons we learned the hard way while getting up to speed on OpenCL. (So I remember them for next time!)
This has been a group effort, but the content, opinions, and mistakes herein are all my own.
Stability & Dev Environment
The first and most important lesson we learned was that each developer needs a dedicated test machine which is not their primary development box. This box needs to be local. When debugging OpenCL programs on real hardware, it is shockingly easy to lock up the entire box. On multiple occasions, we had to perform hard power cycles on our test machine to get it into a usable state.
Even when the box didn’t lock up entirely, a crashed program with a OpenCL kernel outstanding has a bad tendency to prevent future kernels from being executed. Supposedly, there should be a time out that will terminate a run away program, but we never saw this happen in practice. Instead, we ended up rebooting the box quite frequently.
In a related vein, we quickly started replacing every while-loop with a for-loop (over a large, but fixed number of iterations). This allows you to (sometimes) recover from what would otherwise have been an infinitely loop without rebooting the box.
Another important note is that the documentation available from Khronos is a best incomplete and in a couple cases potentially wrong. Many of the function descriptions don’t provide relevant details about usage and none of them provide useful examples. (Can can get some of the latter from the AMD and NVIDIA SDKs.) I strongly suggest searching Google for examples before taking the documentation at its word.
OpenCL does not appear to support a mechanism to forceably abort a kernel. Nor does it support an assertion mechanism. Nor does it have any form of debug logging (i.e. printf or the like.) The only way to exit a kernel function is to return from the main kernel function with all threads. Unfortunately, this means that error reporting - even for cases where you can easily tell what happened - is extremely hard. I don’t have a great solution. We ended up writing data into global memory - so the CPU could access it after termination - and then trying to exit cleanly. This worked sometimes, but was error prone to say the least.
I haven’t played with the various debuggers and emulators available, but I suspect that would help greatly in debugging.
Synchronization
OpenCL has different synchronization models for threads within a workgroup vs across workgroups on the same device. As far as I can tell, there is no synchronization available between kernels running on different devices on the same machine. (You can use the CPU to coordinate starting and stopping kernels of course.)
Barriers apply only to threads within a single workgroup. The CLK_LOCAL/GLOBAL_MEM_FENCE parameters enforce memory consistency within a single workgroup, not across workgroups. Note that you can have a barrier - where all threads stop - but not have a consistent view of memory if you don’t pass the appropriate flags.
ALL threads within a workgroup must encounter the same barrier. If even a single thread does not, the program will hang indefinitely. (And require a hard reboot of the machine.) This is unpleasant to debug to say the least.
Atomic operations are the only way to synchronize between workgroups. To avoid memory contention (and thus serialization of requests), you probably want only a single thread per workgroup to execute the atomic operation. Doing this requires an additional synchronization (using a barrier within the workgroup and a temporary local memory value) to get all threads within a workgroup consistent.
Be careful about which versions of the atomic functions you use. OpenCL provides 32 bit vs 64 bit and local vs shared memory versions. The ones we used - which unfortunately are extensions not part of the language, but thankfully seem pretty common - were cl_khr_int64_base_atomics and cl_khr_int64_extended_atomics. I’ve read some reports that the atomic_op functions don’t function the same as the atom_op versions. I can’t find confirmation of this in the documentation, but we used the atom_op versions just in case. Another gotcha is that some cards apparently don’t support the local versions. Check your documentation carefully since by some reports the functions will simply fail silently.
Note that it is unclear whether the atomic operations on global memory are visible by the CPU, different GPUs, or merely different workgroups on the same device. I haven’t spent much time digging through the documentation, but if this matters to you, check! The one thing that is clear from the documentation is that atomic operations executed by different GPUs on a shared address are not guaranteed to be atomic!
Infrastructure
To get good performance - even just to minimize testing time - you should probably be using precompiled files. (Note: These are not binary files and can not be moved between machines. They are purely a caching mechanism.) You’ll need a mechanism - hash, command line parameter, build system, etc.. - to make sure your cached files stay in sync with your source code of course.
Having a separate program which sanity checks your files - i.e. part of your build system - will save you time in the long run. If I get time, I’ll clean the hacky mess I’ve been using and post it here.
Generally, the best way to get data from the CPU to the GPU (at least on our setup) is to use CL_MEM_USE_HOST_PTR. There seems to be a lot of confusion on exactly what this does, the top Google results appear inaccurate, and the documentation isn’t super clear, but some micro benchmarks gave much better results than for either of the other two options. (As always, you can not assume that the CPU and GPU have consistent views of this data or that it’ll be mapped to the same address on both platforms. All synchronization with the GPU kernels has to be explicit.) It’s also unclear to me if OpenCL is required to copy the data back into the host memory after termination or that region can be entirely stale. That wasn’t important for our case, so I never tested it. The documentation is unclear. The best discussion I’ve seen is here, but even that’s somewhat unclear on the finer points.
Depending on what you’re doing, you may find some of the various utility libraries useful - COPRTHR: STDCL, SOCL, or oclUtils from the NIVIDIA SDK. The only one of these I’ve used is the oclUitls files which were moderately useful.
Conclusion
I hope this was useful to you. If you have corrections, or suggestions, please feel free to contact me.