tf::cudaFlowCapturer class
#include <src/taskflow/cuda/cuda_capturer.hpp>

class for building a CUDA task dependency graph through stream capture

Reference
- Constructors, destructors, conversion operators
- Public functions

A cudaFlowCapturer inherits all the base methods from tf::cudaFlowCapturerBase to construct a CUDA task graph through stream capturer. This class also defines a factory interface tf::cudaFlowCapturer::make_capturer for users to create custom capturers with their lifetimes managed by the factory.

The usage of tf::cudaFlowCapturer is similar to tf::cudaFlow, except users can call the method tf::cudaFlowCapturer::on to capture a sequence of asynchronous CUDA operations through the given stream. The following example creates a CUDA graph that captures two kernel tasks, task_1 and task_2, where task_1 runs before task_2.

taskflow.emplace([](tf::cudaFlowCapturer& capturer){

  // capture my_kernel_1 through the given stream managed by the capturer
  auto task_1 = capturer.on([&](cudaStream_t stream){ 
    my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1);
  });

  // capture my_kernel_2 through the given stream managed by the capturer
  auto task_2 = capturer.on([&](cudaStream_t stream){ 
    my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2);
  });

  task_1.precede(task_2);
});

Similar to tf::cudaFlow, a cudaFlowCapturer is a task (tf::Task) created from tf::Taskflow and will be run by one worker thread in the executor. That is, the callable that describes a cudaFlowCapturer will be executed sequentially. Inside a cudaFlow capturer task, different GPU tasks (tf::cudaTask) may run in parallel scheduled by both our capturing algorithm and the CUDA runtime.

Please refer to GPUTaskingcudaFlowCapturer for details.

Constructors, destructors, conversion operators

cudaFlowCapturer(): constrcts a standalone cudaFlowCapturer
~cudaFlowCapturer() virtual: destructs the cudaFlowCapturer

Public functions

auto empty() const -> bool: queries the emptiness of the graph
void dump(std::ostream& os) const: dumps the capture graph into a DOT format through an output stream
template<typename T, typename... ArgsT> auto make_capturer(ArgsT && ... args) -> T*: creates a custom capturer derived from tf::cudaFlowCapturerBase
template<typename OPT, typename... ArgsT> auto make_optimizer(ArgsT && ... args) -> OPT&: enables different optimization algorithms
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr> auto on(C&& callable) -> cudaTask: captures a sequential CUDA operations from the given callable
auto memcpy(void* dst, const void* src, size_t count) -> cudaTask: copies data between host and device asynchronously through a stream
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr> auto copy(T* tgt, const T* src, size_t num) -> cudaTask: captures a copy task of typed data
auto memset(void* ptr, int v, size_t n) -> cudaTask: initializes or sets GPU memory to the given value byte by byte
template<typename F, typename... ArgsT> auto kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask: captures a kernel
template<typename C> auto single_task(C&& callable) -> cudaTask: capturers a kernel to runs the given callable with only one thread
template<typename I, typename C> auto for_each(I first, I last, C&& callable) -> cudaTask: captures a kernel that applies a callable to each dereferenced element of the data array
template<typename I, typename C> auto for_each_index(I first, I last, I step, C&& callable) -> cudaTask: captures a kernel that applies a callable to each index in the range with the step size
template<typename I, typename C, typename... S> auto transform(I first, I last, C&& callable, S... srcs) -> cudaTask: captures a kernel that applies a callable to a source range and stores the result in a target range
template<typename I, typename T, typename C> auto reduce(I first, I last, T* result, C&& op) -> cudaTask: captures a kernel that performs parallel reduction over a range of items
template<typename I, typename T, typename C> auto uninitialized_reduce(I first, I last, T* result, C&& op) -> cudaTask: similar to tf::cudaFlowCapturerBase::reduce but does not assum any initial value to reduce
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr> auto rebind_on(cudaTask task, C&& callable) -> cudaTask: rebinds a capture task to another sequential CUDA operations
auto rebind_memcpy(cudaTask task, void* dst, const void* src, size_t count) -> cudaTask: rebinds a capture task to a memcpy operation
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr> auto rebind_copy(cudaTask task, T* tgt, const T* src, size_t num) -> cudaTask: rebinds a capture task to a copy operation
auto rebind_memset(cudaTask task, void* ptr, int value, size_t n) -> cudaTask: rebinds a capture task to a memset operation
template<typename F, typename... ArgsT> auto rebind_kernel(cudaTask task, dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask: rebinds a capture task to a kernel operation
template<typename P> void offload_until(P&& predicate): offloads the captured cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true
void offload_n(size_t n): offloads the captured cudaFlow and executes it by the given times
void offload(): offloads the captured cudaFlow and executes it once

Function documentation

tf::cudaFlowCapturer::cudaFlowCapturer()

constrcts a standalone cudaFlowCapturer

A standalone cudaFlow capturer does not go through any taskflow and can be run by the caller thread using explicit offload methods (e.g., tf::cudaFlow::offload).

template<typename T, typename... ArgsT>
T* tf::cudaFlowCapturer::make_capturer(ArgsT && ... args)

creates a custom capturer derived from tf::cudaFlowCapturerBase

Template parameters
T	custom capturer type
ArgsT	arguments types
Parameters
args	arguments to forward to construct the custom capturer
Returns	a pointer to the custom capturer

Each cudaFlow capturer keeps a list of custom capturers and manages their lifetimes. The lifetime of each custom capturer is the same as the capturer.

template<typename OPT, typename... ArgsT>
OPT& tf::cudaFlowCapturer::make_optimizer(ArgsT && ... args)

enables different optimization algorithms

Template parameters
OPT	optimizer type
ArgsT	arguments types
Parameters
args	arguments to forward to construct the optimizer
Returns	a reference to the optimizer

We currently supports the following optimization algorithms to capture a user-described cudaFlow:

template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>
cudaTask tf::cudaFlowCapturer::on(C&& callable)

captures a sequential CUDA operations from the given callable

Template parameters
C	callable type constructible with `std::function<void(cudaStream_t)>`
Parameters
callable	a callable to capture CUDA operations with the stream

This methods applies a stream created by the flow to capture a sequence of CUDA operations defined in the callable.

cudaTask tf::cudaFlowCapturer::memcpy(void* dst, const void* src, size_t count)

copies data between host and device asynchronously through a stream

Parameters
dst	destination memory address
src	source memory address
count	size in bytes to copy

The method captures a cudaMemcpyAsync operation through an internal stream.

template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
cudaTask tf::cudaFlowCapturer::copy(T* tgt, const T* src, size_t num)

captures a copy task of typed data

Template parameters
T	element type (non-void)
Parameters
tgt	pointer to the target memory block
src	pointer to the source memory block
num	number of elements to copy
Returns	cudaTask handle

A copy task transfers num*sizeof(T) bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

cudaTask tf::cudaFlowCapturer::memset(void* ptr, int v, size_t n)

initializes or sets GPU memory to the given value byte by byte

Parameters
ptr	pointer to GPU mempry
v	value to set for each byte of the specified memory
n	size in bytes to set

The method captures a cudaMemsetAsync operation through an internal stream to fill the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

template<typename F, typename... ArgsT>
cudaTask tf::cudaFlowCapturer::kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args)

captures a kernel

Template parameters
F	kernel function type
ArgsT	kernel function parameters type
Parameters
g	configured grid
b	configured block
s	configured shared memory size in bytes
f	kernel function
args	arguments to forward to the kernel function by copy
Returns	cudaTask handle

template<typename C>
cudaTask tf::cudaFlowCapturer::single_task(C&& callable)

capturers a kernel to runs the given callable with only one thread

Template parameters
C	callable type
Parameters
callable	callable to run by a single kernel thread