class
cudaFlowCapturerclass for building a CUDA task dependency graph through stream capture
Contents
A cudaFlowCapturer inherits all the base methods from tf::
The usage of tf::task_1
and task_2
, where task_1
runs before task_2
.
taskflow.emplace([](tf::cudaFlowCapturer& capturer){ // capture my_kernel_1 through the given stream managed by the capturer auto task_1 = capturer.on([&](cudaStream_t stream){ my_kernel_1<<<grid_1, block_1, shm_size_1, stream>>>(my_parameters_1); }); // capture my_kernel_2 through the given stream managed by the capturer auto task_2 = capturer.on([&](cudaStream_t stream){ my_kernel_2<<<grid_2, block_2, shm_size_2, stream>>>(my_parameters_2); }); task_1.precede(task_2); });
Similar to tf::
Please refer to GPUTaskingcudaFlowCapturer for details.
Constructors, destructors, conversion operators
- cudaFlowCapturer()
- constrcts a standalone cudaFlowCapturer
- ~cudaFlowCapturer() virtual
- destructs the cudaFlowCapturer
Public functions
- auto empty() const -> bool
- queries the emptiness of the graph
- void dump(std::ostream& os) const
- dumps the capture graph into a DOT format through an output stream
-
template<typename T, typename... ArgsT>auto make_capturer(ArgsT && ... args) -> T*
- creates a custom capturer derived from tf::
cudaFlowCapturerBase -
template<typename OPT, typename... ArgsT>auto make_optimizer(ArgsT && ... args) -> OPT&
- enables different optimization algorithms
-
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>auto on(C&& callable) -> cudaTask
- captures a sequential CUDA operations from the given callable
- auto memcpy(void* dst, const void* src, size_t count) -> cudaTask
- copies data between host and device asynchronously through a stream
-
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>auto copy(T* tgt, const T* src, size_t num) -> cudaTask
- captures a copy task of typed data
- auto memset(void* ptr, int v, size_t n) -> cudaTask
- initializes or sets GPU memory to the given value byte by byte
-
template<typename F, typename... ArgsT>auto kernel(dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask
- captures a kernel
-
template<typename C>auto single_task(C&& callable) -> cudaTask
- capturers a kernel to runs the given callable with only one thread
-
template<typename I, typename C>auto for_each(I first, I last, C&& callable) -> cudaTask
- captures a kernel that applies a callable to each dereferenced element of the data array
-
template<typename I, typename C>auto for_each_index(I first, I last, I step, C&& callable) -> cudaTask
- captures a kernel that applies a callable to each index in the range with the step size
-
template<typename I, typename C, typename... S>auto transform(I first, I last, C&& callable, S... srcs) -> cudaTask
- captures a kernel that applies a callable to a source range and stores the result in a target range
-
template<typename I, typename T, typename C>auto reduce(I first, I last, T* result, C&& op) -> cudaTask
- captures a kernel that performs parallel reduction over a range of items
-
template<typename I, typename T, typename C>auto uninitialized_reduce(I first, I last, T* result, C&& op) -> cudaTask
- similar to tf::cudaFlowCapturerBase::reduce but does not assum any initial value to reduce
-
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>auto rebind_on(cudaTask task, C&& callable) -> cudaTask
- rebinds a capture task to another sequential CUDA operations
- auto rebind_memcpy(cudaTask task, void* dst, const void* src, size_t count) -> cudaTask
- rebinds a capture task to a memcpy operation
-
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>auto rebind_copy(cudaTask task, T* tgt, const T* src, size_t num) -> cudaTask
- rebinds a capture task to a copy operation
- auto rebind_memset(cudaTask task, void* ptr, int value, size_t n) -> cudaTask
- rebinds a capture task to a memset operation
-
template<typename F, typename... ArgsT>auto rebind_kernel(cudaTask task, dim3 g, dim3 b, size_t s, F&& f, ArgsT && ... args) -> cudaTask
- rebinds a capture task to a kernel operation
-
template<typename P>void offload_until(P&& predicate)
- offloads the captured cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true
- void offload_n(size_t n)
- offloads the captured cudaFlow and executes it by the given times
- void offload()
- offloads the captured cudaFlow and executes it once
Function documentation
tf:: cudaFlowCapturer:: cudaFlowCapturer()
constrcts a standalone cudaFlowCapturer
A standalone cudaFlow capturer does not go through any taskflow and can be run by the caller thread using explicit offload methods (e.g., tf::
template<typename T, typename... ArgsT>
T* tf:: cudaFlowCapturer:: make_capturer(ArgsT && ... args)
creates a custom capturer derived from tf::
Template parameters | |
---|---|
T | custom capturer type |
ArgsT | arguments types |
Parameters | |
args | arguments to forward to construct the custom capturer |
Returns | a pointer to the custom capturer |
Each cudaFlow capturer keeps a list of custom capturers and manages their lifetimes. The lifetime of each custom capturer is the same as the capturer.
template<typename OPT, typename... ArgsT>
OPT& tf:: cudaFlowCapturer:: make_optimizer(ArgsT && ... args)
enables different optimization algorithms
Template parameters | |
---|---|
OPT | optimizer type |
ArgsT | arguments types |
Parameters | |
args | arguments to forward to construct the optimizer |
Returns | a reference to the optimizer |
We currently supports the following optimization algorithms to capture a user-described cudaFlow:
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>
cudaTask tf:: cudaFlowCapturer:: on(C&& callable)
captures a sequential CUDA operations from the given callable
Template parameters | |
---|---|
C | callable type constructible with std::function<void(cudaStream_t)> |
Parameters | |
callable | a callable to capture CUDA operations with the stream |
This methods applies a stream created by the flow to capture a sequence of CUDA operations defined in the callable.
cudaTask tf:: cudaFlowCapturer:: memcpy(void* dst,
const void* src,
size_t count)
copies data between host and device asynchronously through a stream
Parameters | |
---|---|
dst | destination memory address |
src | source memory address |
count | size in bytes to copy |
The method captures a cudaMemcpyAsync
operation through an internal stream.
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
cudaTask tf:: cudaFlowCapturer:: copy(T* tgt,
const T* src,
size_t num)
captures a copy task of typed data
Template parameters | |
---|---|
T | element type (non-void) |
Parameters | |
tgt | pointer to the target memory block |
src | pointer to the source memory block |
num | number of elements to copy |
Returns | cudaTask handle |
A copy task transfers num*sizeof(T)
bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.
cudaTask tf:: cudaFlowCapturer:: memset(void* ptr,
int v,
size_t n)
initializes or sets GPU memory to the given value byte by byte
Parameters | |
---|---|
ptr | pointer to GPU mempry |
v | value to set for each byte of the specified memory |
n | size in bytes to set |
The method captures a cudaMemsetAsync
operation through an internal stream to fill the first count
bytes of the memory area pointed to by devPtr
with the constant byte value value
.
template<typename F, typename... ArgsT>
cudaTask tf:: cudaFlowCapturer:: kernel(dim3 g,
dim3 b,
size_t s,
F&& f,
ArgsT && ... args)
captures a kernel
Template parameters | |
---|---|
F | kernel function type |
ArgsT | kernel function parameters type |
Parameters | |
g | configured grid |
b | configured block |
s | configured shared memory size in bytes |
f | kernel function |
args | arguments to forward to the kernel function by copy |
Returns | cudaTask handle |
template<typename C>
cudaTask tf:: cudaFlowCapturer:: single_task(C&& callable)
capturers a kernel to runs the given callable with only one thread
Template parameters | |
---|---|
C | callable type |
Parameters | |
callable | callable to run by a single kernel thread |
template<typename I, typename C>
cudaTask tf:: cudaFlowCapturer:: for_each(I first,
I last,
C&& callable)
captures a kernel that applies a callable to each dereferenced element of the data array
Template parameters | |
---|---|
I | iterator type |
C | callable type |
Parameters | |
first | iterator to the beginning (inclusive) |
last | iterator to the end (exclusive) |
callable | a callable object to apply to the dereferenced iterator |
Returns | cudaTask handle |
This method is equivalent to the parallel execution of the following loop on a GPU:
for(auto itr = first; itr != last; i++) { callable(*itr); }
template<typename I, typename C>
cudaTask tf:: cudaFlowCapturer:: for_each_index(I first,
I last,
I step,
C&& callable)
captures a kernel that applies a callable to each index in the range with the step size
Template parameters | |
---|---|
I | index type |
C | callable type |
Parameters | |
first | beginning index |
last | last index |
step | step size |
callable | the callable to apply to each element in the data array |
Returns | cudaTask handle |
This method is equivalent to the parallel execution of the following loop on a GPU:
// step is positive [first, last) for(auto i=first; i<last; i+=step) { callable(i); } // step is negative [first, last) for(auto i=first; i>last; i+=step) { callable(i); }
template<typename I, typename C, typename... S>
cudaTask tf:: cudaFlowCapturer:: transform(I first,
I last,
C&& callable,
S... srcs)
captures a kernel that applies a callable to a source range and stores the result in a target range
Template parameters | |
---|---|
I | iterator type |
C | callable type |
S | source types |
Parameters | |
first | iterator to the beginning (inclusive) |
last | iterator to the end (exclusive) |
callable | the callable to apply to each element in the range |
srcs | iterators to the source ranges |
Returns | cudaTask handle |
This method is equivalent to the parallel execution of the following loop on a GPU:
while (first != last) { *first++ = callable(*src1++, *src2++, *src3++, ...); }
template<typename I, typename T, typename C>
cudaTask tf:: cudaFlowCapturer:: reduce(I first,
I last,
T* result,
C&& op)
captures a kernel that performs parallel reduction over a range of items
Template parameters | |
---|---|
I | input iterator type |
T | value type |
C | callable type |
Parameters | |
first | iterator to the beginning (inclusive) |
last | iterator to the end (exclusive) |
result | pointer to the result with an initialized value |
op | binary reduction operator |
Returns | a tf:: |
This method is equivalent to the parallel execution of the following loop on a GPU:
while (first != last) { *result = op(*result, *first++); }
template<typename I, typename T, typename C>
cudaTask tf:: cudaFlowCapturer:: uninitialized_reduce(I first,
I last,
T* result,
C&& op)
similar to tf::cudaFlowCapturerBase::reduce but does not assum any initial value to reduce
This method is equivalent to the parallel execution of the following loop on a GPU:
*result = *first++; // no initial values partitipcate in the loop while (first != last) { *result = op(*result, *first++); }
template<typename C, std::enable_if_t<std::is_invocable_r_v<void, C, cudaStream_t>, void>* = nullptr>
cudaTask tf:: cudaFlowCapturer:: rebind_on(cudaTask task,
C&& callable)
rebinds a capture task to another sequential CUDA operations
The method is similar to cudaFlowCapturerBase::on but with an additional argument on a previously created capture task.
cudaTask tf:: cudaFlowCapturer:: rebind_memcpy(cudaTask task,
void* dst,
const void* src,
size_t count)
rebinds a capture task to a memcpy operation
The method is similar to cudaFlowCapturerBase::memcpy but with an additional argument on a previously created ceapture task.
template<typename T, std::enable_if_t<!std::is_same_v<T, void>, void>* = nullptr>
cudaTask tf:: cudaFlowCapturer:: rebind_copy(cudaTask task,
T* tgt,
const T* src,
size_t num)
rebinds a capture task to a copy operation
The method is similar to cudaFlowCapturerBase::copy but with an additional argument on a previously created ceapture task.
cudaTask tf:: cudaFlowCapturer:: rebind_memset(cudaTask task,
void* ptr,
int value,
size_t n)
rebinds a capture task to a memset operation
The method is similar to cudaFlowCapturerBase::memset but with an additional argument on a previously created ceapture task.
template<typename F, typename... ArgsT>
cudaTask tf:: cudaFlowCapturer:: rebind_kernel(cudaTask task,
dim3 g,
dim3 b,
size_t s,
F&& f,
ArgsT && ... args)
rebinds a capture task to a kernel operation
The method is similar to cudaFlowCapturerBase::kernel but with an additional argument on a previously created ceapture task.
template<typename P>
void tf:: cudaFlowCapturer:: offload_until(P&& predicate)
offloads the captured cudaFlow onto a GPU and repeatedly runs it until the predicate becomes true
Template parameters | |
---|---|
P | predicate type (a binary callable) |
Parameters | |
predicate | a binary predicate (returns true for stop) |
Immediately offloads the cudaFlow captured so far onto a GPU and repeatedly runs it until the predicate returns true
.
By default, if users do not offload the cudaFlow capturer, the executor will offload it once.
void tf:: cudaFlowCapturer:: offload_n(size_t n)
offloads the captured cudaFlow and executes it by the given times
Parameters | |
---|---|
n | number of executions |