CUDA programming pitfalls 1: atomic operations
I have been using atomic operations in CUDA to implement some lock-free data structures. A few days ago, when I was testing a lock-free queue I wrote, I ran into a bug. The program runs normally in device emulation mode. But in real mode, the result is always wrong. Since the code was very simple and straightforward. I considered it a compiler bug. So I filed a bug report to nVIDIA and asked for a workaround. Some days later, they told me that they have confirmed the bug, and that the bug happens when I passed the address of a local variable to the atomic operations. The atomic operations are supposed to work with global memory addresses. So when I pass the address of a local variable, the atomic operation sees it as a global memory address, and thus changes some memory unexpectedly, yielding the wrong result. This happens in CUDA 2.1. They may fix it in later revision. But remember for now, do NOT pass the address of a local variable to the atomic operations.