tf::cudaFlow class
#include <src/taskflow/cuda/cuda_flow.hpp>

class for building a CUDA task dependency graph

Reference
- Constructors, destructors, conversion operators
- Public functions

A cudaFlow is a high-level interface over CUDA Graph to perform GPU operations using the task dependency graph model. The class provides a set of methods for creating and launch different tasks on one or multiple CUDA devices, for instance, kernel tasks, data transfer tasks, and memory operation tasks. The following example creates a cudaFlow of two kernel tasks, task1 and task2, where task1 runs before task2.

tf::Taskflow taskflow;
tf::Executor executor;

taskflow.emplace([&](tf::cudaFlow& cf){
  // create two kernel tasks 
  tf::cudaTask task1 = cf.kernel(grid1, block1, shm_size1, kernel1, args1);
  tf::cudaTask task2 = cf.kernel(grid2, block2, shm_size2, kernel2, args2);
  
  // kernel1 runs before kernel2
  task1.precede(task2);
});

executor.run(taskflow).wait();

A cudaFlow is a task (tf::Task) created from tf::Taskflow and will be run by one worker thread in the executor. That is, the callable that describes a cudaFlow will be executed sequentially. Inside a cudaFlow task, different GPU tasks (tf::cudaTask) may run in parallel scheduled by the CUDA runtime.

Please refer to GPUTaskingcudaFlow for details.

Constructors, destructors, conversion operators

cudaFlow(): constructs a standalone cudaFlow
~cudaFlow(): destroys the cudaFlow and its associated native CUDA graph and executable graph

Public functions

auto empty() const -> bool: queries the emptiness of the graph
void dump(std::ostream& os) const: dumps the cudaFlow graph into a DOT format through an output stream
void dump_native_graph(std::ostream& os) const: dumps the native CUDA graph into a DOT format through an output stream
auto noop() -> cudaTask: creates a no-operation task
template<typename C> auto host(C&& callable) -> cudaTask: creates a host task that runs a callable on the host
template<typename F, typename... ArgsT> auto kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask: creates a kernel task
template<typename F, typename... ArgsT> auto kernel_on(int d, dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask: creates a kernel task on a specific GPU
auto memset(void* dst, int v, size_t count) -> cudaTask: creates a memset task that fills untyped data with a byte value
auto memcpy(void* tgt, const void* src, size_t bytes) -> cudaTask: creates a memcpy task that copies untyped data in bytes
template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> auto zero(T* dst, size_t count) -> cudaTask: creates a memset task that sets a typed memory block to zero
template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> auto fill(T* dst, T value, size_t count) -> cudaTask: creates a memset task that fills a typed memory block with a value
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr> auto copy(T* tgt, const T* src, size_t num) -> cudaTask: creates a memcopy task that copies typed data
template<typename P> void offload_until(P&& predicate): offloads the cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true
void offload_n(size_t N): offloads the cudaFlow and executes it by the given times
void offload(): offloads the cudaFlow and executes it once
template<typename C> void update_host(cudaTask task, C&& callable): updates parameters of a host task created from tf::cudaFlow::host
template<typename... ArgsT> void update_kernel(cudaTask task, dim3 g, dim3 b, size_t shm, ArgsT && ... args): updates parameters of a kernel task created from tf::cudaFlow::kernel
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr> void update_copy(cudaTask task, T* tgt, const T* src, size_t num): updates parameters of a memcpy task to form a copy task
void update_memcpy(cudaTask task, void* tgt, const void* src, size_t bytes): updates parameters of a memcpy task
void update_memset(cudaTask task, void* dst, int ch, size_t count): updates parameters of a memset task
template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> void update_fill(cudaTask task, T* dst, T value, size_t count): updates parameters of a memset task to form a fill task
template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> void update_zero(cudaTask task, T* dst, size_t count): updates parameters of a memset task to form a zero task
template<typename C> auto single_task(C&& callable) -> cudaTask: runs a callable with only a single kernel thread
template<typename I, typename C> auto for_each(I first, I last, C&& callable) -> cudaTask: applies a callable to each dereferenced element of the data array
template<typename I, typename C> auto for_each_index(I first, I last, I step, C&& callable) -> cudaTask: applies a callable to each index in the range with the step size
template<typename I, typename C, typename... S> auto transform(I first, I last, C&& callable, S... srcs) -> cudaTask: applies a callable to a source range and stores the result in a target range
template<typename I, typename T, typename C> auto reduce(I first, I last, T* result, C&& op) -> cudaTask: performs parallel reduction over a range of items
template<typename I, typename T, typename C> auto uninitialized_reduce(I first, I last, T* result, C&& op) -> cudaTask: similar to tf::cudaFlow::reduce but does not assume any initial value to reduce
template<typename C> auto capture(C&& callable) -> cudaTask: constructs a subflow graph through tf::cudaFlowCapturer

Function documentation

tf::cudaFlow::cudaFlow()

constructs a standalone cudaFlow

A standalone cudaFlow does not go through any taskflow and can be run by the caller thread using explicit offload methods (e.g., tf::cudaFlow::offload).

void tf::cudaFlow::dump_native_graph(std::ostream& os) const

dumps the native CUDA graph into a DOT format through an output stream

The native CUDA graph may be different from the upper-level cudaFlow graph when flow capture is involved.

cudaTask tf::cudaFlow::noop()

creates a no-operation task

Returns	a tf::cudaTask handle

An empty node performs no operation during execution, but can be used for transitive ordering. For example, a phased execution graph with 2 groups of n nodes with a barrier between them can be represented using an empty node and 2*n dependency edges, rather than no empty node and n^2 dependency edges.

template<typename C>
cudaTask tf::cudaFlow::host(C&& callable)

creates a host task that runs a callable on the host

Template parameters
C	callable type
Parameters
callable	a callable object with neither arguments nor return (i.e., constructible from `std::function<void()>`)
Returns	a tf::cudaTask handle

A host task can only execute CPU-specific functions and cannot do any CUDA calls (e.g., cudaMalloc).

template<typename F, typename... ArgsT>
cudaTask tf::cudaFlow::kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args)

creates a kernel task

Template parameters
F	kernel function type
ArgsT	kernel function parameters type
Parameters
g	configured grid
b	configured block
s	configured shared memory size in bytes
f	kernel function
args	arguments to forward to the kernel function by copy
Returns	a tf::cudaTask handle

template<typename F, typename... ArgsT>
cudaTask tf::cudaFlow::kernel_on(int d, dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args)

creates a kernel task on a specific GPU

Template parameters
F	kernel function type
ArgsT	kernel function parameters type
Parameters
d	device identifier to launch the kernel
g	configured grid
b	configured block
s	configured shared memory size in bytes
f	kernel function
args	arguments to forward to the kernel function by copy
Returns	a tf::cudaTask handle

cudaTask tf::cudaFlow::memset(void* dst, int v, size_t count)

creates a memset task that fills untyped data with a byte value

Parameters
dst	pointer to the destination device memory area
v	value to set for each byte of specified memory
count	size in bytes to set
Returns	a tf::cudaTask handle

A memset task fills the first count bytes of device memory area pointed by dst with the byte value v.

cudaTask tf::cudaFlow::memcpy(void* tgt, const void* src, size_t bytes)

creates a memcpy task that copies untyped data in bytes

Parameters
tgt	pointer to the target memory block
src	pointer to the source memory block
bytes	bytes to copy
Returns	a tf::cudaTask handle

A memcpy task transfers bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
cudaTask tf::cudaFlow::zero(T* dst, size_t count)

creates a memset task that sets a typed memory block to zero

Template parameters
T	element type (size of `T` must be either 1, 2, or 4)
Parameters
dst	pointer to the destination device memory area
count	number of elements
Returns	a tf::cudaTask handle

A zero task zeroes the first count elements of type T in a device memory area pointed by dst.

template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
cudaTask tf::cudaFlow::fill(T* dst, T value, size_t count)

creates a memset task that fills a typed memory block with a value

Template parameters
T	element type (size of `T` must be either 1, 2, or 4)
Parameters
dst	pointer to the destination device memory area
value	value to fill for each element of type `T`
count	number of elements
Returns	a tf::cudaTask handle

A fill task fills the first count elements of type T with value in a device memory area pointed by dst. The value to fill is interpreted in type T rather than byte.

template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
cudaTask tf::cudaFlow::copy(T* tgt, const T* src, size_t num)

creates a memcopy task that copies typed data

Template parameters
T	element type (non-void)
Parameters
tgt	pointer to the target memory block
src	pointer to the source memory block
num	number of elements to copy
Returns	a tf::cudaTask handle

A copy task transfers num*sizeof(T) bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

template<typename P>
void tf::cudaFlow::offload_until(P&& predicate)

offloads the cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true

Template parameters
P	predicate type (a binary callable)
Parameters
predicate	a binary predicate (returns `true` for stop)

Immediately offloads the present cudaFlow onto a GPU and repeatedly runs it until the predicate returns true.

An offloaded cudaFlow forces the underlying graph to be instantiated. After the instantiation, you should not modify the graph topology but update node parameters.

By default, if users do not offload the cudaFlow, the executor will offload it once.

void tf::cudaFlow::offload_n(size_t N)

offloads the cudaFlow and executes it by the given times

Parameters
N	number of executions

template<typename C>
void tf::cudaFlow::update_host(cudaTask task, C&& callable)

updates parameters of a host task created from tf::cudaFlow::host

The method updates the parameters of a host callable associated with the given task.

template<typename... ArgsT>
void tf::cudaFlow::update_kernel(cudaTask task, dim3 g, dim3 b, size_t shm, ArgsT && ... args)

updates parameters of a kernel task created from tf::cudaFlow::kernel

The method updates the parameters of a kernel associated with the given task. We do not allow you to change the kernel function.

template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
void tf::cudaFlow::update_copy(cudaTask task, T* tgt, const T* src, size_t num)

updates parameters of a memcpy task to form a copy task

The method updates the parameters of a copy task. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.

void tf::cudaFlow::update_memcpy(cudaTask task, void* tgt, const void* src, size_t bytes)

updates parameters of a memcpy task

The method updates the parameters of a memcpy task. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.

void tf::cudaFlow::update_memset(cudaTask task, void* dst, int ch, size_t count)

updates parameters of a memset task

The method updates the parameters of a memset task. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.

template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
void tf::cudaFlow::update_fill(cudaTask task, T* dst, T value, size_t count)

updates parameters of a memset task to form a fill task

The method updates the parameters of a copy task. The given arguments and type must comply with the rules of tf::cudaFlow::fill. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.

template<typename T, std::enable_if_t<is_pod_v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
void tf::cudaFlow::update_zero(cudaTask task, T* dst, size_t count)

updates parameters of a memset task to form a zero task

The method updates the parameters of a copy task. The given arguments and type must comply with the rules of tf::cudaFlow::zero. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.

template<typename C>
cudaTask tf::cudaFlow::single_task(C&& callable)

runs a callable with only a single kernel thread

Template parameters
C	callable type
Parameters
callable	callable to run by a single kernel thread
Returns	a tf::cudaTask handle