Using The NVIDIA CUDA Stream-Ordered Memory Allocator Part 1
Most CUDA developers are familiar with the cudaMalloc and cudaFree API features to allocate GPU accessible memory. However, there has lengthy been an impediment with these API features: they aren’t stream ordered. On this submit, we introduce new API features, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In part 2 of this collection, we spotlight the benefits of this new capability by sharing some huge data benchmark outcomes and supply a code migration information for modifying your present applications. We additionally cowl superior subjects to take advantage of stream-ordered memory allocation in the context of multi-GPU entry and cognitive enhancement tool the usage of IPC. This all helps you improve efficiency within your current applications. The next code example on the left is inefficient as a result of the first cudaFree name has to anticipate kernelA to finish, so it synchronizes the machine before freeing the memory. To make this run more efficiently, the memory will be allotted upfront and sized to the larger of the two sizes, as shown on the suitable.
This will increase code complexity in the appliance because the memory management code is separated out from the enterprise logic. The issue is exacerbated when other libraries are involved. This is way more durable for the application to make environment friendly as a result of it might not have full visibility or cognitive enhancement tool control over what the library is doing. To circumvent this drawback, the library would have to allocate memory when that function is invoked for the first time and never free it until the library is deinitialized. This not solely will increase code complexity, but it additionally causes the library to carry on to the memory longer than it must, probably denying another portion of the appliance from utilizing that memory. Some purposes take the idea of allocating memory upfront even additional by implementing their own customized allocator. This provides a big amount of complexity to utility development. CUDA aims to provide a low-effort, cognitive enhancement tool high-performance various.
CUDA 11.2 launched a stream-ordered memory allocator to solve most of these problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift memory allocation from global-scope operations that synchronize all the device to stream-ordered operations that enable you to compose memory administration with GPU work submission. This eliminates the necessity for synchronizing outstanding GPU work and helps limit the lifetime of the allocation to the GPU work that accesses it. It is now possible to manage memory at function scope, as in the next example of a library operate launching kernelA. All the usual stream-ordering guidelines apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync might be accessed by any kernel or memcpy operation as long because the kernel or memcpy is ordered to execute after the allocation operation and cognitive enhancement tool before the deallocation operation, in stream order. Deallocation will be performed in any stream, as long as it is ordered to execute after the allocation operation and in spite of everything accesses on all streams of that memory on the GPU.