This post is part of a series where we discuss some features of the brandnew hipSYCL 0.9.1. Today I want to take a closer look at
Asynchronous buffers and explicit buffer policies
This is a new extension in hipSYCL that can make code using
sycl::buffer objects much clearer while also improving performance. Interested? Then this blog post is for you.
Motivation 1: Buffers are complicated
sycl::buffer is a very complicated object. Depending on a combination of multiple factors the semantics of a
sycl::buffer can be very different. Will it operate directly on input pointers or will it copy input data to some internal storage? Will it submit a writeback in the destructor to copy data back to host?
I have frequently noticed users getting this wrong. This can either lead to correctness issues, for example
- the buffer operates directly on the input pointer, while the user has only intended to provide it as a source of initial data and wanted to reuse it after buffer construction
- no writeback is issued even though the user expected data to be copied back to host.
Or performance bugs might be introduced - these are arguably even worse because you might not notice them right away and they might be difficult to find. Some performance bugs that I have seen in user code are:
- The buffer issued an unexpected writeback, and thus copied data back to host without the user intending it
- The buffer did not operate directly on the pointer provided in the constructor, but instead first copied the data to internal storage which broke performance assumptions on the CPU backend.
Motivation 2: The buffer destructor antipattern
In addition, there is a related performance antipattern that I have noticed frequently. Consider the following code:
We construct three buffers that get an input pointer and then, when the scope closes, issue a writeback in their destructors. The problem here is that the execution of writebacks is not really efficient. The SYCL specification requires that in the destructor, a
buffer has to wait for the completion of all operations that use it. This means that the following sequence of operations will be executed:
b3.~buffer()runs: submit writeback, wait for completion
b2.~buffer()runs: submit writeback, wait for completion
b1.~buffer()runs: submit writeback, wait for completion
Here we have multiple unnecessary cases of synchronization. For performance it is always better to submit all available work asynchronously, and then wait as late as possible with as few wait calls as possible. So, something like the following will in general perform better:
b3.~buffer()runs: submit writeback asynchronously
b2.~buffer()runs: submit writeback asynchronously
b1.~buffer()runs: submit writeback asynchronously
- Maybe do some other work while the writebacks are being processed
- Wait for all writebacks to complete
This has multiple advantages:
- The SYCL implementation can process a larger task graph consisting of multiple writebacks as well as any other operations that might have been submitted previously, allowing for more optimization opportunities
- There is less latency between the writebacks when they are processed by the SYCL backend and hardware, because there is no synchronization in between them.
- The execution of writeback can be overlapped with other work on the host if the wait is executed later.
Note: While the worst case is clearly when the buffers submit writebacks as in this example, even if the buffers do not submit a writeback, there might still be a negative performance impact: Because all buffer destructors need to wait individually, they cause individual and potentially unnecessary flushes of the SYCL task graph.
Enter explicit buffer policies
To address both the destructor antipattern as well as the complexity of buffers, hipSYCL 0.9.1 introduces explicit buffer policies, which allow the user to explicitly specify the desired behavior of a buffer. We introduce the following terminology:
|Destructor blocks?||Writes back ?||Uses external storage?|
For example, a
sync_writeback_view refers to the behavior where the destructor blocks (
sync), a writeback will be issued in the destructor (
writeback) and the buffer will operate directly on provided input data pointers (
These behaviors are not expressed as new C++ types, but as regular
sycl::buffer objects that are initialized with special buffer properties. buffers with explicit behaviors are constructed using factory functions such as
buffer<T, Dim> make_sync_buffer(...).
Since these functions still return a
sycl::buffer<T, Dim>, explicit buffer behaviors integrate well with existing SYCL code that relies on the
Using those factory functions instead of directly constructing
sycl::buffer objects significantly improves code clarity - the programmer can now see with one quick glance at the function call what is going to happen, and what performance implications there are.
view behavior operate directly on the provided input pointer when running on the CPU backend. The pointer must be considered as being in use by the buffer until all operations that the buffer is involved in have completed, including potential writebacks.
buffer behavior will not operate directly on optionally provided input pointers. If an input data pointer is provided, the data content will be copied to internal storage. The pointer is safe to use (or delete) as desired by the user after the buffer constructor returns.
writeback behavior will submit a writeback operation to migrate data back to host in the destructor. This will only lead to an actual data copy if the data on the host is outdated. With hipSYCL explicit buffer behaviors, a writeback needs to be explicitly requested by invoking a buffer factory function with
writeback in its name. This prevents users accidentally introducing performance bugs by means of unnecessary writebacks.
Only buffers with
sync behavior block in their destructor. Buffers of
async behavior do not - and therefore can be used to solve the buffer destructor performance antipattern:
Here async writeback views are used that do not block in their destructor. hipSYCL guarantees that memory allocated by buffer objects will not be freed if there are still operations in flight utilizing those allocations, so kernels and other operations using the buffer objects will complete successfully even if the user-facing buffer object has already been destroyed.
For performance it should be considered best practice to use the async behaviors by default and only use the sync variants when it is absolutely necessary.
Not every combination of buffer behaviors makes sense. hipSYCL currently supports the following factory functions:
For the full API reference, see the hipSYCL documentation.