tf::cudaFlowCapturer class

class for building a CUDA task dependency graph through stream capture

A cudaFlowCapturer inherits all the base methods from tf::cudaFlowCapturerBase to construct a CUDA task graph through stream capturer. This class also defines a factory interface tf::cudaFlowCapturer::make_capturer for users to create custom capturers with their lifetimes managed by the factory.

The usage of tf::cudaFlowCapturer is similar to tf::cudaFlow, except users can call the method tf::cudaFlowCapturer::on to capture a sequence of asynchronous CUDA operations through the given stream. The following example creates a CUDA graph that captures two kernel tasks, task_1 and task_2, where task_1 runs before task_2.

taskflow.emplace([](tf::cudaFlowCapturer& capturer){

  // capture my_kernel_1 through the given stream managed by the capturer
  auto task_1 = capturer.on([&](cudaStream_t stream){ 
    my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1);
  });

  // capture my_kernel_2 through the given stream managed by the capturer
  auto task_2 = capturer.on([&](cudaStream_t stream){ 
    my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2);
  });

  task_1.precede(task_2);
});

Similar to tf::cudaFlow, a cudaFlowCapturer is a task (tf::Task) created from tf::Taskflow and will be run by one worker thread in the executor. That is, the callable that describes a cudaFlowCapturer will be executed sequentially. Inside a cudaFlow capturer task, different GPU tasks (tf::cudaTask) may run in parallel scheduled by both our capturing algorithm and the CUDA runtime.

Please refer to GPUTaskingcudaFlowCapturer for details.

Constructors, destructors, conversion operators

cudaFlowCapturer()
constrcts a standalone cudaFlowCapturer
~cudaFlowCapturer() virtual
destructs the cudaFlowCapturer

Public functions

auto empty() const -> bool
queries the emptiness of the graph
void dump(std::ostream& os) const
dumps the capture graph into a DOT format through an output stream
template<typename T, typename... ArgsT>
auto make_capturer(ArgsT && ... args) -> T*
creates a custom capturer derived from tf::cudaFlowCapturerBase
template<typename OPT, typename... ArgsT>
auto make_optimizer(ArgsT && ... args) -> OPT&
enables different optimization algorithms
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>
auto on(C&& callable) -> cudaTask
captures a sequential CUDA operations from the given callable
auto memcpy(void* dst, const void* src, size_t count) -> cudaTask
copies data between host and device asynchronously through a stream
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
auto copy(T* tgt, const T* src, size_t num) -> cudaTask
captures a copy task of typed data
auto memset(void* ptr, int v, size_t n) -> cudaTask
initializes or sets GPU memory to the given value byte by byte
template<typename F, typename... ArgsT>
auto kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask
captures a kernel
template<typename C>
auto single_task(C&& callable) -> cudaTask
capturers a kernel to runs the given callable with only one thread
template<typename I, typename C>
auto for_each(I first, I last, C&& callable) -> cudaTask
captures a kernel that applies a callable to each dereferenced element of the data array
template<typename I, typename C>
auto for_each_index(I first, I last, I step, C&& callable) -> cudaTask
captures a kernel that applies a callable to each index in the range with the step size
template<typename I, typename C, typename... S>
auto transform(I first, I last, C&& callable, S... srcs) -> cudaTask
captures a kernel that applies a callable to a source range and stores the result in a target range
template<typename I, typename T, typename C>
auto reduce(I first, I last, T* result, C&& op) -> cudaTask
captures a kernel that performs parallel reduction over a range of items
template<typename I, typename T, typename C>
auto uninitialized_reduce(I first, I last, T* result, C&& op) -> cudaTask
similar to tf::cudaFlowCapturerBase::reduce but does not assum any initial value to reduce
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>
auto rebind_on(cudaTask task, C&& callable) -> cudaTask
rebinds a capture task to another sequential CUDA operations
auto rebind_memcpy(cudaTask task, void* dst, const void* src, size_t count) -> cudaTask
rebinds a capture task to a memcpy operation
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
auto rebind_copy(cudaTask task, T* tgt, const T* src, size_t num) -> cudaTask
rebinds a capture task to a copy operation
auto rebind_memset(cudaTask task, void* ptr, int value, size_t n) -> cudaTask
rebinds a capture task to a memset operation
template<typename F, typename... ArgsT>
auto rebind_kernel(cudaTask task, dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask
rebinds a capture task to a kernel operation
template<typename P>
void offload_until(P&& predicate)
offloads the captured cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true
void offload_n(size_t n)
offloads the captured cudaFlow and executes it by the given times
void offload()
offloads the captured cudaFlow and executes it once

Function documentation

tf::cudaFlowCapturer::cudaFlowCapturer()

constrcts a standalone cudaFlowCapturer

A standalone cudaFlow capturer does not go through any taskflow and can be run by the caller thread using explicit offload methods (e.g., tf::cudaFlow::offload).

template<typename T, typename... ArgsT>
T* tf::cudaFlowCapturer::make_capturer(ArgsT && ... args)

creates a custom capturer derived from tf::cudaFlowCapturerBase

Template parameters
T custom capturer type
ArgsT arguments types
Parameters
args arguments to forward to construct the custom capturer
Returns a pointer to the custom capturer

Each cudaFlow capturer keeps a list of custom capturers and manages their lifetimes. The lifetime of each custom capturer is the same as the capturer.

template<typename OPT, typename... ArgsT>
OPT& tf::cudaFlowCapturer::make_optimizer(ArgsT && ... args)

enables different optimization algorithms

Template parameters
OPT optimizer type
ArgsT arguments types
Parameters
args arguments to forward to construct the optimizer
Returns a reference to the optimizer

We currently supports the following optimization algorithms to capture a user-described cudaFlow:

template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>
cudaTask tf::cudaFlowCapturer::on(C&& callable)

captures a sequential CUDA operations from the given callable

Template parameters
C callable type constructible with std::function<void(cudaStream_t)>
Parameters
callable a callable to capture CUDA operations with the stream

This methods applies a stream created by the flow to capture a sequence of CUDA operations defined in the callable.

cudaTask tf::cudaFlowCapturer::memcpy(void* dst, const void* src, size_t count)

copies data between host and device asynchronously through a stream

Parameters
dst destination memory address
src source memory address
count size in bytes to copy

The method captures a cudaMemcpyAsync operation through an internal stream.

template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
cudaTask tf::cudaFlowCapturer::copy(T* tgt, const T* src, size_t num)

captures a copy task of typed data

Template parameters
T element type (non-void)
Parameters
tgt pointer to the target memory block
src pointer to the source memory block
num number of elements to copy
Returns cudaTask handle

A copy task transfers num*sizeof(T) bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

cudaTask tf::cudaFlowCapturer::memset(void* ptr, int v, size_t n)

initializes or sets GPU memory to the given value byte by byte

Parameters
ptr pointer to GPU mempry
v value to set for each byte of the specified memory
n size in bytes to set

The method captures a cudaMemsetAsync operation through an internal stream to fill the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

template<typename F, typename... ArgsT>
cudaTask tf::cudaFlowCapturer::kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args)

captures a kernel

Template parameters
F kernel function type
ArgsT kernel function parameters type
Parameters
g configured grid
b configured block
s configured shared memory size in bytes
f kernel function
args arguments to forward to the kernel function by copy
Returns cudaTask handle

template<typename C>
cudaTask tf::cudaFlowCapturer::single_task(C&& callable)

capturers a kernel to runs the given callable with only one thread

Template parameters
C callable type
Parameters
callable callable to run by a single kernel thread

template<typename I, typename C>
cudaTask tf::cudaFlowCapturer::for_each(I first, I last, C&& callable)

captures a kernel that applies a callable to each dereferenced element of the data array

Template parameters
I iterator type
C callable type
Parameters
first iterator to the beginning (inclusive)
last iterator to the end (exclusive)
callable a callable object to apply to the dereferenced iterator
Returns cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

for(auto itr = first; itr != last; i++) {
  callable(*itr);
}

template<typename I, typename C>
cudaTask tf::cudaFlowCapturer::for_each_index(I first, I last, I step, C&& callable)

captures a kernel that applies a callable to each index in the range with the step size

Template parameters
I index type
C callable type
Parameters
first beginning index
last last index
step step size
callable the callable to apply to each element in the data array
Returns cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

// step is positive [first, last)
for(auto i=first; i<last; i+=step) {
  callable(i);
}

// step is negative [first, last)
for(auto i=first; i>last; i+=step) {
  callable(i);
}

template<typename I, typename C, typename... S>
cudaTask tf::cudaFlowCapturer::transform(I first, I last, C&& callable, S... srcs)

captures a kernel that applies a callable to a source range and stores the result in a target range

Template parameters
I iterator type
C callable type
S source types
Parameters
first iterator to the beginning (inclusive)
last iterator to the end (exclusive)
callable the callable to apply to each element in the range
srcs iterators to the source ranges
Returns cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

while (first != last) {
  *first++ = callable(*src1++, *src2++, *src3++, ...);
}

template<typename I, typename T, typename C>
cudaTask tf::cudaFlowCapturer::reduce(I first, I last, T* result, C&& op)

captures a kernel that performs parallel reduction over a range of items

Template parameters
I input iterator type
T value type
C callable type
Parameters
first iterator to the beginning (inclusive)
last iterator to the end (exclusive)
result pointer to the result with an initialized value
op binary reduction operator
Returns a tf::cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

while (first != last) {
  *result = op(*result, *first++);
}

template<typename I, typename T, typename C>
cudaTask tf::cudaFlowCapturer::uninitialized_reduce(I first, I last, T* result, C&& op)

similar to tf::cudaFlowCapturerBase::reduce but does not assum any initial value to reduce

This method is equivalent to the parallel execution of the following loop on a GPU:

*result = *first++;  // no initial values partitipcate in the loop
while (first != last) {
  *result = op(*result, *first++);
}

template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>
cudaTask tf::cudaFlowCapturer::rebind_on(cudaTask task, C&& callable)

rebinds a capture task to another sequential CUDA operations

The method is similar to cudaFlowCapturerBase::on but with an additional argument on a previously created capture task.

cudaTask tf::cudaFlowCapturer::rebind_memcpy(cudaTask task, void* dst, const void* src, size_t count)

rebinds a capture task to a memcpy operation

The method is similar to cudaFlowCapturerBase::memcpy but with an additional argument on a previously created ceapture task.

template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
cudaTask tf::cudaFlowCapturer::rebind_copy(cudaTask task, T* tgt, const T* src, size_t num)

rebinds a capture task to a copy operation

The method is similar to cudaFlowCapturerBase::copy but with an additional argument on a previously created ceapture task.

cudaTask tf::cudaFlowCapturer::rebind_memset(cudaTask task, void* ptr, int value, size_t n)

rebinds a capture task to a memset operation

The method is similar to cudaFlowCapturerBase::memset but with an additional argument on a previously created ceapture task.

template<typename F, typename... ArgsT>
cudaTask tf::cudaFlowCapturer::rebind_kernel(cudaTask task, dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args)

rebinds a capture task to a kernel operation

The method is similar to cudaFlowCapturerBase::kernel but with an additional argument on a previously created ceapture task.

template<typename P>
void tf::cudaFlowCapturer::offload_until(P&& predicate)

offloads the captured cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true

Template parameters
P predicate type (a binary callable)
Parameters
predicate a binary predicate (returns true for stop)

Immediately offloads the cudaFlow captured so far onto a GPU and repeatedly runs it until the predicate returns true.

By default, if users do not offload the cudaFlow capturer, the executor will offload it once.

void tf::cudaFlowCapturer::offload_n(size_t n)

offloads the captured cudaFlow and executes it by the given times

Parameters
n number of executions