class
#include <src/taskflow/cuda/cuda_flow.hpp>
cudaFlow class for building a CUDA task dependency graph
Contents
A cudaFlow is a high-level interface over CUDA Graph to perform GPU operations using the task dependency graph model. The class provides a set of methods for creating and launch different tasks on one or multiple CUDA devices, for instance, kernel tasks, data transfer tasks, and memory operation tasks. The following example creates a cudaFlow of two kernel tasks, task1
and task2
, where task1
runs before task2
.
tf::Taskflow taskflow; tf::Executor executor; taskflow.emplace([&](tf::cudaFlow& cf){ // create two kernel tasks tf::cudaTask task1 = cf.kernel(grid1, block1, shm_size1, kernel1, args1); tf::cudaTask task2 = cf.kernel(grid2, block2, shm_size2, kernel2, args2); // kernel1 runs before kernel2 task1.precede(task2); }); executor.run(taskflow).wait();
A cudaFlow is a task (tf::
Please refer to GPUTaskingcudaFlow for details.
Constructors, destructors, conversion operators
Public functions
- auto empty() const -> bool
- queries the emptiness of the graph
- void dump(std::ostream& os) const
- dumps the cudaFlow graph into a DOT format through an output stream
- void dump_native_graph(std::ostream& os) const
- dumps the native CUDA graph into a DOT format through an output stream
- auto noop() -> cudaTask
- creates a no-operation task
-
template<typename C>auto host(C&& callable) -> cudaTask
- creates a host task that runs a callable on the host
-
template<typename F, typename... ArgsT>auto kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask
- creates a kernel task
-
template<typename F, typename... ArgsT>auto kernel_on(int d, dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask
- creates a kernel task on a specific GPU
- auto memset(void* dst, int v, size_t count) -> cudaTask
- creates a memset task that fills untyped data with a byte value
- auto memcpy(void* tgt, const void* src, size_t bytes) -> cudaTask
- creates a memcpy task that copies untyped data in bytes
-
template<typename T, std::enable_if_t<is_auto zero(T* dst, size_t count) -> cudaTask
pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> - creates a memset task that sets a typed memory block to zero
-
template<typename T, std::enable_if_t<is_auto fill(T* dst, T value, size_t count) -> cudaTask
pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> - creates a memset task that fills a typed memory block with a value
-
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>auto copy(T* tgt, const T* src, size_t num) -> cudaTask
- creates a memcopy task that copies typed data
-
template<typename P>void offload_until(P&& predicate)
- offloads the cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true
- void offload_n(size_t N)
- offloads the cudaFlow and executes it by the given times
- void offload()
- offloads the cudaFlow and executes it once
-
template<typename C>void update_host(cudaTask task, C&& callable)
- updates parameters of a host task created from tf::
cudaFlow:: host -
template<typename... ArgsT>void update_kernel(cudaTask task, dim3 g, dim3 b, size_t shm, ArgsT && ... args)
- updates parameters of a kernel task created from tf::
cudaFlow:: kernel -
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>void update_copy(cudaTask task, T* tgt, const T* src, size_t num)
- updates parameters of a memcpy task to form a copy task
- void update_memcpy(cudaTask task, void* tgt, const void* src, size_t bytes)
- updates parameters of a memcpy task
- void update_memset(cudaTask task, void* dst, int ch, size_t count)
- updates parameters of a memset task
-
template<typename T, std::enable_if_t<is_void update_fill(cudaTask task, T* dst, T value, size_t count)
pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> - updates parameters of a memset task to form a fill task
-
template<typename T, std::enable_if_t<is_void update_zero(cudaTask task, T* dst, size_t count)
pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr> - updates parameters of a memset task to form a zero task
-
template<typename C>auto single_task(C&& callable) -> cudaTask
- runs a callable with only a single kernel thread
-
template<typename I, typename C>auto for_each(I first, I last, C&& callable) -> cudaTask
- applies a callable to each dereferenced element of the data array
-
template<typename I, typename C>auto for_each_index(I first, I last, I step, C&& callable) -> cudaTask
- applies a callable to each index in the range with the step size
-
template<typename I, typename C, typename... S>auto transform(I first, I last, C&& callable, S... srcs) -> cudaTask
- applies a callable to a source range and stores the result in a target range
-
template<typename I, typename T, typename C>auto reduce(I first, I last, T* result, C&& op) -> cudaTask
- performs parallel reduction over a range of items
-
template<typename I, typename T, typename C>auto uninitialized_reduce(I first, I last, T* result, C&& op) -> cudaTask
- similar to tf::
cudaFlow:: reduce but does not assume any initial value to reduce -
template<typename C>auto capture(C&& callable) -> cudaTask
- constructs a subflow graph through tf::
cudaFlowCapturer
Function documentation
tf:: cudaFlow:: cudaFlow()
constructs a standalone cudaFlow
A standalone cudaFlow does not go through any taskflow and can be run by the caller thread using explicit offload methods (e.g., tf::
void tf:: cudaFlow:: dump_native_graph(std::ostream& os) const
dumps the native CUDA graph into a DOT format through an output stream
The native CUDA graph may be different from the upper-level cudaFlow graph when flow capture is involved.
cudaTask tf:: cudaFlow:: noop()
creates a no-operation task
Returns | a tf:: |
---|
An empty node performs no operation during execution, but can be used for transitive ordering. For example, a phased execution graph with 2 groups of n
nodes with a barrier between them can be represented using an empty node and 2*n
dependency edges, rather than no empty node and n^2
dependency edges.
template<typename C>
cudaTask tf:: cudaFlow:: host(C&& callable)
creates a host task that runs a callable on the host
Template parameters | |
---|---|
C | callable type |
Parameters | |
callable | a callable object with neither arguments nor return (i.e., constructible from std::function<void()> ) |
Returns | a tf:: |
A host task can only execute CPU-specific functions and cannot do any CUDA calls (e.g., cudaMalloc
).
template<typename F, typename... ArgsT>
cudaTask tf:: cudaFlow:: kernel(dim3 g,
dim3 b,
size_t s,
F&& f,
ArgsT && ... args)
creates a kernel task
Template parameters | |
---|---|
F | kernel function type |
ArgsT | kernel function parameters type |
Parameters | |
g | configured grid |
b | configured block |
s | configured shared memory size in bytes |
f | kernel function |
args | arguments to forward to the kernel function by copy |
Returns | a tf:: |
template<typename F, typename... ArgsT>
cudaTask tf:: cudaFlow:: kernel_on(int d,
dim3 g,
dim3 b,
size_t s,
F&& f,
ArgsT && ... args)
creates a kernel task on a specific GPU
Template parameters | |
---|---|
F | kernel function type |
ArgsT | kernel function parameters type |
Parameters | |
d | device identifier to launch the kernel |
g | configured grid |
b | configured block |
s | configured shared memory size in bytes |
f | kernel function |
args | arguments to forward to the kernel function by copy |
Returns | a tf:: |
cudaTask tf:: cudaFlow:: memset(void* dst,
int v,
size_t count)
creates a memset task that fills untyped data with a byte value
Parameters | |
---|---|
dst | pointer to the destination device memory area |
v | value to set for each byte of specified memory |
count | size in bytes to set |
Returns | a tf:: |
A memset task fills the first count
bytes of device memory area pointed by dst
with the byte value v
.
cudaTask tf:: cudaFlow:: memcpy(void* tgt,
const void* src,
size_t bytes)
creates a memcpy task that copies untyped data in bytes
Parameters | |
---|---|
tgt | pointer to the target memory block |
src | pointer to the source memory block |
bytes | bytes to copy |
Returns | a tf:: |
A memcpy task transfers bytes
of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.
template<typename T, std::enable_if_t<is_ pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
cudaTask tf:: cudaFlow:: zero(T* dst,
size_t count)
creates a memset task that sets a typed memory block to zero
Template parameters | |
---|---|
T | element type (size of T must be either 1, 2, or 4) |
Parameters | |
dst | pointer to the destination device memory area |
count | number of elements |
Returns | a tf:: |
A zero task zeroes the first count
elements of type T
in a device memory area pointed by dst
.
template<typename T, std::enable_if_t<is_ pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
cudaTask tf:: cudaFlow:: fill(T* dst,
T value,
size_t count)
creates a memset task that fills a typed memory block with a value
Template parameters | |
---|---|
T | element type (size of T must be either 1, 2, or 4) |
Parameters | |
dst | pointer to the destination device memory area |
value | value to fill for each element of type T |
count | number of elements |
Returns | a tf:: |
A fill task fills the first count
elements of type T
with value
in a device memory area pointed by dst
. The value to fill is interpreted in type T
rather than byte.
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
cudaTask tf:: cudaFlow:: copy(T* tgt,
const T* src,
size_t num)
creates a memcopy task that copies typed data
Template parameters | |
---|---|
T | element type (non-void) |
Parameters | |
tgt | pointer to the target memory block |
src | pointer to the source memory block |
num | number of elements to copy |
Returns | a tf:: |
A copy task transfers num*sizeof(T)
bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.
template<typename P>
void tf:: cudaFlow:: offload_until(P&& predicate)
offloads the cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true
Template parameters | |
---|---|
P | predicate type (a binary callable) |
Parameters | |
predicate | a binary predicate (returns true for stop) |
Immediately offloads the present cudaFlow onto a GPU and repeatedly runs it until the predicate returns true
.
An offloaded cudaFlow forces the underlying graph to be instantiated. After the instantiation, you should not modify the graph topology but update node parameters.
By default, if users do not offload the cudaFlow, the executor will offload it once.
void tf:: cudaFlow:: offload_n(size_t N)
offloads the cudaFlow and executes it by the given times
Parameters | |
---|---|
N | number of executions |
template<typename C>
void tf:: cudaFlow:: update_host(cudaTask task,
C&& callable)
updates parameters of a host task created from tf::
The method updates the parameters of a host callable associated with the given task
.
template<typename... ArgsT>
void tf:: cudaFlow:: update_kernel(cudaTask task,
dim3 g,
dim3 b,
size_t shm,
ArgsT && ... args)
updates parameters of a kernel task created from tf::
The method updates the parameters of a kernel associated with the given task
. We do not allow you to change the kernel function.
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
void tf:: cudaFlow:: update_copy(cudaTask task,
T* tgt,
const T* src,
size_t num)
updates parameters of a memcpy task to form a copy task
The method updates the parameters of a copy task. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.
void tf:: cudaFlow:: update_memcpy(cudaTask task,
void* tgt,
const void* src,
size_t bytes)
updates parameters of a memcpy task
The method updates the parameters of a memcpy task. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.
void tf:: cudaFlow:: update_memset(cudaTask task,
void* dst,
int ch,
size_t count)
updates parameters of a memset task
The method updates the parameters of a memset task. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory.
template<typename T, std::enable_if_t<is_ pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
void tf:: cudaFlow:: update_fill(cudaTask task,
T* dst,
T value,
size_t count)
updates parameters of a memset task to form a fill task
The method updates the parameters of a copy task. The given arguments and type must comply with the rules of tf::
template<typename T, std::enable_if_t<is_ pod_ v<T> && (sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void>* = nullptr>
void tf:: cudaFlow:: update_zero(cudaTask task,
T* dst,
size_t count)
updates parameters of a memset task to form a zero task
The method updates the parameters of a copy task. The given arguments and type must comply with the rules of tf::
template<typename C>
cudaTask tf:: cudaFlow:: single_task(C&& callable)
runs a callable with only a single kernel thread
Template parameters | |
---|---|
C | callable type |
Parameters | |
callable | callable to run by a single kernel thread |
Returns | a tf:: |
template<typename I, typename C>
cudaTask tf:: cudaFlow:: for_each(I first,
I last,
C&& callable)
applies a callable to each dereferenced element of the data array
Template parameters | |
---|---|
I | iterator type |
C | callable type |
Parameters | |
first | iterator to the beginning (inclusive) |
last | iterator to the end (exclusive) |
callable | a callable object to apply to the dereferenced iterator |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
for(auto itr = first; itr != last; itr++) { callable(*itr); }
template<typename I, typename C>
cudaTask tf:: cudaFlow:: for_each_index(I first,
I last,
I step,
C&& callable)
applies a callable to each index in the range with the step size
Template parameters | |
---|---|
I | index type |
C | callable type |
Parameters | |
first | beginning index |
last | last index |
step | step size |
callable | the callable to apply to each element in the data array |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
// step is positive [first, last) for(auto i=first; i<last; i+=step) { callable(i); } // step is negative [first, last) for(auto i=first; i>last; i+=step) { callable(i); }
template<typename I, typename C, typename... S>
cudaTask tf:: cudaFlow:: transform(I first,
I last,
C&& callable,
S... srcs)
applies a callable to a source range and stores the result in a target range
Template parameters | |
---|---|
I | iterator type |
C | callable type |
S | source types |
Parameters | |
first | iterator to the beginning (inclusive) |
last | iterator to the end (exclusive) |
callable | the callable to apply to each element in the range |
srcs | iterators to the source ranges |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
while (first != last) { *first++ = callable(*src1++, *src2++, *src3++, ...); }
template<typename I, typename T, typename C>
cudaTask tf:: cudaFlow:: reduce(I first,
I last,
T* result,
C&& op)
performs parallel reduction over a range of items
Template parameters | |
---|---|
I | input iterator type |
T | value type |
C | callable type |
Parameters | |
first | iterator to the beginning (inclusive) |
last | iterator to the end (exclusive) |
result | pointer to the result with an initialized value |
op | binary reduction operator |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
while (first != last) { *result = op(*result, *first++); }
template<typename I, typename T, typename C>
cudaTask tf:: cudaFlow:: uninitialized_reduce(I first,
I last,
T* result,
C&& op)
similar to tf::
This method is equivalent to the parallel execution of the following loop on a GPU:
*result = *first++; // no initial values partitipcate in the loop while (first != last) { *result = op(*result, *first++); }
template<typename C>
cudaTask tf:: cudaFlow:: capture(C&& callable)
constructs a subflow graph through tf::
Template parameters | |
---|---|
C | callable type constructible from std::function<void(tf::cudaFlowCapturer&)> |
Parameters | |
callable | the callable to construct a capture flow |
Returns | a tf:: |
A captured subflow forms a sub-graph to the cudaFlow and can be used to capture custom (or third-party) kernels that cannot be directly constructed from the cudaFlow.
Example usage:
taskflow.emplace([&](tf::cudaFlow& cf){ tf::cudaTask my_kernel = cf.kernel(my_arguments); // create a flow capturer to capture custom kernels tf::cudaTask my_subflow = cf.capture([&](tf::cudaFlowCapturer& capturer){ capturer.on([&](cudaStream_t stream){ invoke_custom_kernel_with_stream(stream, custom_arguments); }); }); my_kernel.precede(my_subflow); });