Copyright (c) 2022-2023 Intel Corporation. All rights reserved.
Note
|
Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos. |
To report problems with this extension, please open a new issue at:
Pablo Reble, Intel
Julian Miller, Intel
John Pennycook, Intel
Guo Yejun, Intel
Dan Holmes, Intel
Greg Lueck, Intel
Steffen Larsen, Intel
Jaime Arteaga Molina, Intel
Andrei Elovikov, Intel
Ewan Crawford, Codeplay
Ben Tracy, Codeplay
Duncan McBain, Codeplay
Peter Žužek, Codeplay
Ruyman Reyes, Codeplay
Gordon Brown, Codeplay
Erik Tomusk, Codeplay
Bjoern Knafla, Codeplay
Lukas Sommer, Codeplay
Maxime France-Pillois, Codeplay
Jack Kirk, Codeplay
Ronan Keryell, AMD
Andrey Alekseenko, KTH Royal Institute of Technology
Fábio Mestre, Codeplay
Konrad Kusiak, Codeplay
This extension is written against the SYCL 2020 revision 6 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.
This is an experimental extension specification, intended to provide early access to features and gather community feedback. Interfaces defined in this specification are implemented in DPC++, but they are not finalized and may change incompatibly in future versions of DPC++ without prior notice. Shipping software products should not rely on APIs defined in this specification.
With command groups SYCL is already able to create an implicit dependency graph (in the form of a directed acyclic graph) of kernel execution at runtime, as a command group object defines a set of requisites (edges) which must be satisfied for commands (nodes) to be executed. However, because command-group submission is tied to execution on the queue, without having a prior construction step before starting execution, optimization opportunities are missed from the runtime not being made aware of a defined dependency graph ahead of execution.
The following benefits would become possible if the user could define a dependency graph to the SYCL runtime prior to execution:
-
Reduction in runtime overhead by only submitting a single graph object, rather than many individual command groups.
-
Enable more work to be done ahead of time to improve runtime performance. This early work could be done in a setup phase of the program prior to repeated executions of the graph. Alternately, a future offline AOT compiler in a different process could be run prior to the execution of the application.
-
Unlock DMA hardware features through graph analysis by the runtime.
-
Graph optimizations become available, including but not limited to:
-
Kernel fusion/fission.
-
Inter-node memory reuse from data staying resident on device.
-
Identification of the peak intermediate output memory requirement, used for more optimal memory allocation.
-
As well as benefits to the SYCL runtime, there are also advantages to the user developing SYCL applications, as repetitive workloads no longer have to redundantly issue the same sequence of commands. Instead, a graph is only constructed once and submitted for execution as many times as is necessary, only changing the data in input buffers or USM allocations. For applications from specific domains, such as machine learning, where the same command group pattern is run repeatedly for different inputs, this is particularly useful.
In order to achieve the goals described in previous sections, the following requirements were considered:
-
Ability to update parameters of the graph between submissions, without changing the overall graph structure.
-
Enable low effort porting of existing applications to use the extension.
-
Profiling, debugging, and tracing functionality at the granularity of graph nodes.
-
Integrate sub-graphs (previously constructed graphs) when constructing a new graph.
-
Support the USM model of memory as well as buffer/accessor model.
-
Compatible with other SYCL extensions and features, e.g. kernel fusion & built-in kernels.
-
Ability to record a graph with commands submitted to different devices in the same context.
-
Capability to serialize graphs to a binary format which can then be de-serialized and executed. This is helpful for AOT cases where a graph can be created by an offline tool to be loaded and run without the end-user incurring the overheads of graph creation.
-
Backend interoperability, the ability to retrieve a native graph object from the graph and use that in a native backend API.
To allow for prototype implementations of this extension to be developed quickly for evaluation the scope of this proposal was limited to a subset of these requirements. In particular, the serialization functionality (8), backend interoperability (9), and a profiling/debugging interface (3) were omitted. As these are not easy to abstract over several backends without significant investigation. It is also hoped these features can be exposed as additive changes to the API, and thus introduced in future versions of the extension.
Another reason for deferring a serialize/deserialize API (8) is that its scope could extend from emitting the graph in a binary format, to emitting a standardized IR format that enables further device specific graph optimizations.
Multi-device support (7) is something that we are considering introducing into the extension in later revisions, which may result in API changes. It has been planned for to the extent that the definition of a graph node is device specific, however currently all nodes in a graph must target the same device provided to the graph constructor.
This extension contains two different API mechanisms for constructing a graph of commands:
-
Explicit graph building API - Allows users to specify the exact nodes and edges they want to add to the graph.
-
Queue recording API (aka "Record & Replay") - Introduces state to a
sycl::queue
such that rather than scheduling commands immediately for execution, they are added to the graph object instead, with edges captured from the dependencies of the command group.
Each of these mechanisms for constructing a graph have their own advantages, so having both APIs available allows the user to pick the one which is most suitable for them. The queue recording API allows quicker porting of existing applications, and can capture external work that is submitted to a queue, for example via library function calls. While the explicit API can better express what data is internal to the graph for optimization, and dependencies don’t need to be inferred.
It is valid to combine these two mechanisms, however it is invalid to modify a graph using the explicit API while that graph is currently recording commands from any queue, for example:
graph.begin_recording(queue);
graph.add(/*command group*/); // Invalid as graph is recording a queue
graph.end_recording();
This extension provides a feature-test macro as described in the core SYCL
specification section 6.3.3 "Feature test macros". Therefore, an
implementation supporting this extension must predefine the macro
SYCL_EXT_ONEAPI_GRAPH
to one of the values defined in the table below.
Applications can test for the existence of this macro to determine if the
implementation supports this feature, or applications can test the macro’s
value to determine which of the extension’s APIs the implementation supports.
Table 1. Values of the SYCL_EXT_ONEAPI_GRAPH
macro.
Value | Description |
---|---|
1 |
Initial extension version. Base features are supported. |
Table 2. Terminology.
Concept | Description |
---|---|
Graph |
A directed and acyclic graph (DAG) of commands (nodes) and their dependencies
(edges), represented by the |
Node |
A command, which can have different attributes, targeting a specific device. |
Edge |
Dependency between commands as a happens-before relationship. |
When using the explicit graph building API to construct a graph, nodes and edges are captured as follows.
Table 3. Explicit Graph Definition.
Concept | Description |
---|---|
Node |
In the explicit graph building API nodes are created by the user invoking methods on a modifiable graph passing a command-group function (CGF). Each node represents either a command-group or an empty operation. |
Edge |
In the explicit graph building API edges are primarily defined by the user
through newly added interfaces. This is either using the Edges can also be created when explicitly adding nodes to the graph through
existing SYCL mechanisms for expressing dependencies. Data dependencies from
accessors to existing nodes in the graph are captured as an edge. Using
|
When using the record & replay API to construct a graph by recording a queue, nodes and edges are captured as follows.
Table 4. Recorded Graph Definition.
Concept | Description |
---|---|
Node |
A node in a queue recorded graph represents a command-group submission to the
device associated with the queue being recorded. Nodes are constructed from
the command-group functions (CGF) passed to |
Edge |
An edge in a queue recorded graph is expressed through command group
dependencies in one of three ways. Firstly, through buffer accessors that
represent data dependencies between two command groups captured as nodes.
Secondly, by using the |
A node in a graph can take the form of a nested sub-graph. This occurs when
a command-group submission that invokes handler::ext_oneapi_graph()
with an
executable graph object is added to the graph as a node. The child graph node is
scheduled in the parent graph as-if edges are created to connect the root nodes
of the child graph with the dependent nodes of the parent graph.
Adding an executable graph as a sub-graph does not affect its existing node dependencies, such that it could be submitted in future without any side effects of prior uses as a sub-graph.
Due to the experimental nature of the extension, support is not available across all devices.
Table 5. Device Support Aspect.
Device Descriptor | Description |
---|---|
|
Indicates that the device supports all the APIs described in this extension. |
|
Indicates that the device supports all the APIs described in this extension except for those described in the Executable Graph Update section. This is a temporary aspect that we intend to remove once devices with full graph support are more prevalent. |
namespace sycl::ext::oneapi::experimental {
enum class node_type {
empty,
subgraph,
kernel,
memcpy,
memset,
memfill,
prefetch,
memadvise,
ext_oneapi_barrier,
host_task,
};
class node {
public:
node() = delete;
node_type get_type() const;
std::vector<node> get_predecessors() const;
std::vector<node> get_successors() const;
static node get_node_from_event(event nodeEvent);
template <int Dimensions>
void update_nd_range(nd_range<Dimensions> executionRange);
template <int Dimensions>
void update_range(range<Dimensions> executionRange);
};
} // sycl::namespace ext::oneapi::experimental
Node is a class that encapsulates tasks like SYCL kernel functions, or memory operations for deferred execution. A graph must be created first, the structure of a graph is defined second by adding nodes and edges.
The node
class provides the common reference semantics.
Table 6. Member functions of the node
class.
Member Function | Description |
---|---|
node_type get_type() const; |
Returns a value representing the type of command this node represents. |
std::vector<node> get_predecessors() const; |
Returns a list of the predecessor nodes which this node directly depends on. |
std::vector<node> get_successors() const; |
Returns a list of the successor nodes which directly depend on this node. |
static node get_node_from_event(event nodeEvent); |
Finds the node associated with an event created from a submission to a queue in the recording state. Parameters:
Returns: Graph node that was created when the command that returned
Exceptions:
|
template <int Dimensions>
void update_nd_range(nd_range<Dimensions> executionRange); |
Updates the ND-range for this node with a new value. This new value will not affect any executable graphs this node is part of until it is passed to the executable graph’s update function. See Executable Graph Update for more information about updating kernel nodes. Parameters:
Exceptions:
|
template <int Dimensions>
void update_range(range<Dimensions> executionRange); |
Updates the execution range for this node with a new value. This new value will not affect any executable graphs this node is part of until it is passed to the executable graph’s update function. See Executable Graph Update for more information about updating kernel nodes. Parameters:
Exceptions:
|
namespace ext::oneapi::experimental{
template <typename ValueT>
class dynamic_parameter {
public:
dynamic_parameter(command_graph<graph_state::modifiable> graph, const ValueT &initialValue);
void update(const ValueT& newValue);
};
}
Dynamic parameters are arguments to a node’s command-group which can be updated by the user after the node has been added to a graph. Updating the value of a dynamic parameter will be reflected in the modifiable graph which contains this node. These updated nodes can then be passed to an executable graph to update it with new values.
The type of the underlying object a dynamic parameter represents is set at compile time using a template parameter. This underlying type can be an accessor, a pointer to a USM allocation, scalar passed by value, or a raw byte representation of the argument. The raw byte representation is intended to enable updating arguments set using sycl_ext_oneapi_raw_kernel_arg.
Dynamic parameters are registered with nodes in a modifiable graph, with each
registration associating one or more node arguments to the dynamic parameter
instance. Registration happens inside the command-group that the node
represents, and is done when dynamic parameters are set as parameters to the
kernel using handler::set_arg()
/handler::set_args()
. It is valid for a node
argument to be registered with more than one dynamic parameter instance.
See Executable Graph Update for more information about updating node parameters.
The dynamic_parameter
class provides the common reference semantics.
Table 7. Member functions of the dynamic_parameter
class.
Member Function | Description |
---|---|
dynamic_parameter(command_graph<graph_state::modifiable> graph,
const ValueT &initialValue); |
Constructs a dynamic parameter object that can be registered with command graph nodes with an initial value. Parameters:
|
void update(const ValueT& newValue); |
Updates parameters in all nodes registered with this dynamic parameter to
It is not an error if Parameters:
|
namespace ext::oneapi::experimental {
class dynamic_command_group {
public:
dynamic_command_group(
command_graph<graph_state::modifiable> &graph,
const std::vector<std::function<void(handler &)>>& cgfList);
size_t get_active_index() const;
void set_active_index(size_t cgfIndex);
};
Dynamic command-groups can be added as nodes to a graph. They provide a
mechanism that allows updating the command-group function of a node after the
graph is finalized. There is always one command-group function in the dynamic
command-group that is set as active, this is the kernel which will execute for
the node when the graph is finalized into an executable state command_graph
,
and all the other command-group functions in cgfList
will be ignored. The
executable command_graph
node can then be updated to a different kernel in
cgfList
, by selecting a new active index on the dynamic command-group object
and calling the update(node& node)
method on the executable command_graph
.
The dynamic_command_group
class provides the common reference semantics.
See Executable Graph Update for more information about updating command-groups.
Dynamic command-groups can only contain kernel operations. Trying to construct a dynamic command-group with functions that contain other operations will result in an error.
All the command-group functions in a dynamic command-group must have identical dependencies.
It is not allowed for a dynamic command-group to have command-group functions that would
result in a change to the graph topology when set to active. In practice, this means that
any calls to handler.depends_on()
must be identical for all the command-group functions
in a dynamic command-group. The dependencies created by buffer accessors must also create
identical node dependencies across all of the command-group functions.
Table 8. Member functions of the dynamic_command_group
class.
Member Function | Description |
---|---|
dynamic_command_group(
command_graph<graph_state::modifiable> &graph,
const std::vector<std::function<void(handler &)>>& cgfList); |
Constructs a dynamic command-group object that can be added as a node to a Parameters:
Exceptions:
|
size_t get_active_index() const; |
Returns the index of the currently active command-group function in this
|
void set_active_index(size_t cgfIndex); |
Sets the command-group function with index This change will be reflected immediately in the modifiable graph which contains this
Setting Parameters:
Exceptions:
|
namespace sycl::ext::oneapi::experimental::property::node {
class depends_on {
public:
template<typename... NodeTN>
depends_on(NodeTN... nodes);
};
}
The API for explicitly adding nodes to a command_graph
includes a
property_list
parameter. This extension defines the depends_on
property to
be passed here. depends_on
defines any node
objects for the created node to
be dependent on, and therefore form an edge with. These nodes are in addition to
the dependent nodes identified from the command-group requisites of the created
node.
namespace sycl::ext::oneapi::experimental::property::node {
class depends_on_all_leaves {
public:
depends_on_all_leaves() = default;
};
}
The API for explicitly adding nodes to a command_graph
includes a
property_list
parameter. This extension defines the depends_on_all_leaves
property to be passed here. depends_on_all_leaves
provides a shortcut for
adding all the current leaves of a graph as dependencies.
namespace sycl::ext::oneapi::experimental {
// State of a graph
enum class graph_state {
modifiable,
executable
};
// New object representing graph
template<graph_state State = graph_state::modifiable>
class command_graph {};
template<>
class command_graph<graph_state::modifiable> {
public:
command_graph(const context& syclContext, const device& syclDevice,
const property_list& propList = {});
command_graph(const queue& syclQueue,
const property_list& propList = {});
command_graph<graph_state::executable>
finalize(const property_list& propList = {}) const;
void begin_recording(queue& recordingQueue, const property_list& propList = {});
void begin_recording(const std::vector<queue>& recordingQueues, const property_list& propList = {});
void end_recording();
void end_recording(queue& recordingQueue);
void end_recording(const std::vector<queue>& recordingQueues);
node add(const property_list& propList = {});
template<typename T>
node add(T cgf, const property_list& propList = {});
node add(dynamic_command_group& dynamicCG, const property_list& propList = {});
void make_edge(node& src, node& dest);
void print_graph(std::string path, bool verbose = false) const;
std::vector<node> get_nodes() const;
std::vector<node> get_root_nodes() const;
};
template<>
class command_graph<graph_state::executable> {
public:
command_graph() = delete;
void update(node& node);
void update(const std::vector<node>& nodes);
void update(const command_graph<graph_state::modifiable>& graph);
};
} // namespace sycl::ext::oneapi::experimental
This extension adds a new command_graph
object which follows the
common reference semantics of other SYCL runtime objects.
A command_graph
represents a directed acyclic graph of nodes, where each node
represents a single command for a specific device or a sub-graph. The execution
of a graph completes when all its nodes have completed.
A command_graph
is built up by either recording queue submissions or
explicitly adding nodes, then once the user is happy that the graph is complete,
the graph instance is finalized into an executable variant which can have no
more nodes added to it. Finalization may be a computationally expensive
operation as the runtime can perform optimizations based on the graph
structure. After finalization the graph can be submitted for execution on a
queue one or more times with reduced overhead.
A command_graph
can be submitted to both in-order and out-of-order queues. Any
dependencies between the graph and other command-groups submitted to the same
queue will be respected. However, the in-order and out-of-order properties of the
queue have no effect on how the nodes within the graph are executed (e.g. the graph
nodes without dependency edges may execute out-of-order even when using an in-order
queue). For further information about how the properties of a queue affect graphs
see the section on Queue Properties
An instance of a command_graph
object can be in one of two states:
-
Modifiable - Graph is under construction and new nodes may be added to it.
-
Executable - Graph topology is fixed after finalization and graph is ready to be submitted for execution.
A command_graph
object is constructed in the modifiable state and is made
executable by the user invoking command_graph::finalize()
to create a
new executable instance of the graph. An executable graph cannot be converted
to a modifiable graph. After finalizing a graph in the modifiable state, it is
valid for a user to add additional nodes and finalize again to create subsequent
executable graphs. The state of a command_graph
object is made explicit by
templating on state to make the class strongly typed, with the default template
argument being graph_state::modifiable
to reduce code verbosity on
construction.
graph LR
Modifiable -->|Finalize| Executable
A graph in the executable state can have the configuration of its nodes modified using a concept called graph update. This avoids a user having to rebuild and finalize a new executable graph when only the parameters of graph nodes change between submissions.
Updates to a graph will be scheduled after any in-flight executions of the same graph and will not affect previous submissions of the same graph. The user is not required to wait on any previous submissions of a graph before updating it.
To update an executable graph, the property::graph::updatable
property must
have been set when the graph was created during finalization. Otherwise, an
exception will be thrown if a user tries to update an executable graph. This
guarantee allows the backend to provide a more optimized implementation, if
possible.
The only types of nodes that are currently able to be updated in a graph are kernel execution nodes.
There are two different API’s that can be used to update a graph:
-
Individual Node Update which allows updating individual nodes of a command-graph.
-
Whole Graph Update which allows updating the entirety of the graph simultaneously by using another graph as a reference.
The aspects of a kernel execution node that can be changed during update are different depending on the API used to perform the update:
-
For the Individual Node Update API it’s possible to update the kernel function, the parameters to the kernel, and the ND-range.
-
For the Whole Graph Update API, only the parameters of the kernel and the ND-range can be updated.
Individual nodes of an executable graph can be updated directly. Depending on the attribute of the node that requires updating, different API’s should be used:
Parameters to individual nodes in a graph in the executable
state can be
updated between graph executions using dynamic parameters. A dynamic_parameter
object is created with a modifiable state graph and an initial value for the
parameter. Dynamic parameters can then be registered with nodes in that graph
when passed to calls to set_arg()/set_args()
.
Parameter updates are performed using a dynamic_parameter
instance by calling
dynamic_parameter::update()
to update all the parameters of nodes to which the
dynamic_parameter
is registered. Updates will not affect any nodes which were
not registered, even if they use the same parameter value as a
dynamic_parameter
.
Since the structure of the graph became fixed when finalizing, updating parameters on a node will not change the already defined dependencies between nodes. This is important to note when updating buffer parameters to a node, since no edges will be automatically created or removed based on this change. Care should be taken that updates of buffer parameters do not change the behavior of a graph when executed.
For example, if there are two nodes (NodeA and NodeB) which are connected
by an edge due to a dependency on the same buffer, both nodes must have
this buffer parameter updated to the new value. This maintains the correct
data dependency and prevents unexpected behavior. To achieve this, one
dynamic parameter for the buffer can be registered with all the nodes which
use the buffer as a parameter. Then a single dynamic_parameter::update()
call
will maintain the graphs data dependencies.
Another configuration that can be updated is the execution range of the
kernel, this can be set through node::update_nd_range()
or
node::update_range()
but does not require any prior registration.
An alternative way to update the execution range of a node is to do so while
updating command groups as described in the next section. Using this mechanism
lifts the restriction from node::update_nd_range()
/ node::update_range()
of only being to update the execution range in the same dimension. As the
update being tied to a change in command-group means that the updated kernel
code may be defined as operating in a different dimension.
The command-groups of a kernel node can be updated using dynamic command-groups. Dynamic command-groups allow replacing the command-group function of a kernel node with a different one. This effectively allows updating the kernel function and/or the kernel execution range.
Command-group updates are performed by creating an instance of the
dynamic_command_group
class. A dynamic command-group is created with a modifiable
state graph and a list of possible command-group functions. Command-group functions
within a dynamic command-group can then be set to active by using the member function
dynamic_command_group::set_active_index()
.
Dynamic command-groups are compatible with dynamic parameters. This means that dynamic parameters can be used in command-group functions that are part of dynamic command-groups. Updates to such dynamic parameters will be reflected in the command-group functions once they are activated.
Note that the execution range is tied to the command-group, therefore updating
the range of a node which uses a dynamic command-group will update the
execution range of the currently active command-group. If the dynamic
command-group is shared by another node, it will also update the execution
range of the other nodes sharing that dynamic command-group. Activating a
command-group with set_active_index
to a command-group that previously had
its execution range updated with node::update_range()
or
node::update_nd_range()
will not reset the execution range to the original
value, but instead use the most recently updated value.
Updating a node using the methods mentioned above will take effect immediately
for nodes in modifiable command-graphs. However, for graphs that are in the executable
state, in order to commit the update, the updated nodes must be passed to
command_graph<graph_state::executable>::update(node& node)
or
command_graph<graph_state::executable>::update(const std::vector<node>& nodes)
.
A graph in the executable state can have all of its nodes updated using the
command_graph<graph_state::executable>::update(graph)
method. This method
takes a source graph in the modifiable state and updates the nodes in the target
executable state graph to reflect any changes made to the nodes in the source
graph. The characteristics which will be updated are detailed in the section on
Executable Graph Update.
Both the source and target graphs for the update must satisfy the following conditions:
-
Both graphs must have been created with the same device and context.
-
Both graphs must be topologically identical. The graphs are considered topologically identical when:
-
Both graphs must have the same number of nodes and edges.
-
Internal edges must be between corresponding nodes in each graph.
-
Nodes must be added in the same order in the two graphs. Nodes may be added via
command_graph::add
, or for a recorded queue viaqueue::submit
or queue shortcut functions. -
Corresponding nodes in each graph must be kernels that have the same type:
-
When the kernel is defined as a lambda, the lambda must be the same.
-
When the kernel is defined as a named function object, the kernel class must be the same.
-
When the kernel is defined as a plain function, the function must be the same.
-
-
Edge dependencies for each node in the two graphs must be created in the same order by using the same API invocation to create each edge. See the terminology section for an exhaustive definition of how edges are defined in a graph for each of the two graph construction APIs.
-
Attempting to use whole-graph update with source or target graphs which do not satisfy the conditions of topological identity results in undefined behaviour, as it may prevent the runtime from pairing nodes in the source and target graphs.
It is valid to use nodes that contain dynamic parameters in whole graph updates. If a node containing a dynamic parameter is updated through the whole graph update API, then any previous updates to the dynamic parameter will be reflected in the new graph.
namespace sycl::ext::oneapi::experimental::property::graph {
class no_cycle_check {
public:
no_cycle_check() = default;
};
}
The property::graph::no_cycle_check
property disables any checks if a newly
added dependency will lead to a cycle in a specific command_graph
and can be
passed to a command_graph
on construction via the property list parameter.
As a result, no errors are reported when a function tries to create a cyclic
dependency. Thus, it’s the user’s responsibility to create an acyclic graph
for execution when this property is set. Creating a cycle in a command_graph
puts that command_graph
into an undefined state. Any further operations
performed on a command_graph
in this state will result in undefined
behavior.
namespace sycl::ext::oneapi::experimental::property::graph {
class assume_buffer_outlives_graph {
public:
assume_buffer_outlives_graph() = default;
};
}
The property::graph::assume_buffer_outlives_graph
property disables
restrictions on using buffers in a command_graph
and
can be passed to a command_graph
on construction via the property list
parameter. This property represents a promise from the user that any buffer
which is used in a graph will be kept alive on the host for the lifetime of the
graph. Destroying that buffer during the lifetime of a command_graph
constructed with this property results in undefined behavior.
namespace sycl::ext::oneapi::experimental::property::graph {
class updatable {
public:
updatable() = default;
};
}
The property::graph::updatable
property enables updating a command_graph
when passed on finalization of a modifiable command_graph
. For further
information see the section on Executable Graph
Update.
namespace sycl::ext::oneapi::experimental::property::graph {
class enable_profiling {
public:
enable_profiling() = default;
};
}
The property::graph::enable_profiling
property enables profiling events
returned from submissions of the executable graph. Passing this property
implies disabling certain optimizations. As a result, the execution time of a
graph finalized with profiling enabled is longer than that of a graph without
profiling capability. An error will be thrown when attempting to profile an
event from a graph submission that was created without this property.
Table 9. Constructor of the command_graph
class.
Constructor | Description |
---|---|
command_graph(const context& syclContext,
const device& syclDevice,
const property_list& propList = {}); |
Creates a SYCL Constraints:
Parameters:
Exceptions:
|
command_graph(const queue& syclQueue,
const property_list& propList = {}); |
Simplified constructor form where Constraints:
Parameters:
Exceptions:
|
Table 10. Member functions of the command_graph
class.
Member function | Description |
---|---|
node add(const property_list& propList = {}); |
This creates an empty node which contains no command. Its intended use is to make a connection point inside a graph between groups of nodes, and can significantly reduce the number of edges ( O(n) vs. O(n2) ). Constraints:
Parameters:
Returns: The empty node which has been added to the graph. Exceptions:
|
template<typename T>
node add(T cgf, const property_list& propList = {}); |
The Constraints:
Parameters:
Returns: The command-group function object node which has been added to the graph. Exceptions:
|
node add(dynamic_command_group& dynamicCG, const property_list& propList = {}); |
Adds the dynamic command-group The current active command-group function in Constraints:
Parameters:
Returns: The dynamic command-group object node which has been added to the graph. Exceptions:
|
void make_edge(node& src, node& dest); |
Creates a dependency between two nodes representing a happens-before relationship. Constraints:
Parameters:
Exceptions:
|
command_graph<graph_state::executable>
finalize(const property_list& propList = {}) const; |
Synchronous operation that creates a new graph in the executable state with a fixed topology that can be submitted for execution on any queue sharing the context associated with the graph. It is valid to call this method multiple times to create subsequent executable graphs. It is also valid to continue to add new nodes to the modifiable graph instance after calling this function. It is valid to finalize an empty graph instance with no recorded commands. Constraints:
Parameters:
Returns: A new executable graph object which can be submitted to a queue. |
void
print_graph(std::string path, bool verbose = false) const; |
Synchronous operation that writes a DOT formatted description of the graph to the provided path. By default, this includes the graph topology, node types, node id, and kernel names. Verbose can be set to true to write more detailed information about each node type such as kernel arguments, copy source, and destination addresses. At the moment DOT format is the only supported format. The name of the output file must therefore match this extension, i.e. "<filename>.dot". Parameters:
Exceptions:
|
std::vector<node> get_nodes() const; |
Returns a list of all the nodes present in the graph in the order that they were added. |
std::vector<node> get_root_nodes() const; |
Returns a list of all nodes in the graph which have no dependencies. |
Table 11. Member functions of the command_graph
class for
graph update.
Member function | Description |
---|---|
void update(node& node); |
Updates an executable graph node that corresponds to Updating these values will not change the structure of the graph. The implementation may perform a blocking wait during this call on any in-flight executions of that same graph if required by the backend. Constraints:
Parameters:
Exceptions:
|
void update(const std::vector<node>& nodes); |
Updates all executable graph nodes that corresponds to the nodes contained in
Updating these values will not change the structure of the graph. The implementation may perform a blocking wait during this call on any in-flight executions of that same graph if required by the backend. Constraints:
Parameters:
Exceptions:
|
void
update(const command_graph<graph_state::modifiable>& source); |
Updates all of the nodes in the target graph with parameters from a topologically identical source graph in the modifiable state. The full definition of what constitutes a topologically identical graph can be found in the Whole-Graph Update section. Violating any of these topology requirements results in undefined behaviour. The characteristics in the executable graph which will be updated are detailed in the section on Executable Graph Update. It is not an error to update an executable graph such that all parameters of
nodes in The implementation may perform a blocking wait during this call on any in-flight executions of that same graph if required by the backend. This function may only be called if the graph was created with the Constraints:
Parameters:
Exceptions:
|
Table 12. Member functions of the command_graph
class for
queue recording.
Member function | Description |
---|---|
void
begin_recording(queue& recordingQueue,
const property_list& propList = {}) |
Synchronously changes the state of Parameters:
Exceptions:
|
void
begin_recording(const std::vector<queue>& recordingQueues,
const property_list& propList = {}) |
Synchronously changes the state of each queue in Parameters:
Exceptions:
|
void end_recording() |
Synchronously finishes recording on all queues that are recording to the
graph and sets their state to |
void end_recording(queue& recordingQueue) |
Synchronously changes the state of Parameters:
Exceptions:
|
void end_recording(const std::vector<queue>& recordingQueues) |
Synchronously changes the state of each queue in Parameters:
Exceptions:
|
namespace sycl {
namespace ext::oneapi::experimental {
enum class queue_state {
executing,
recording
};
} // namespace ext::oneapi::experimental
// New methods added to the sycl::queue class
using namespace ext::oneapi::experimental;
class queue {
public:
ext::oneapi::experimental::queue_state
ext_oneapi_get_state() const;
ext::oneapi::experimental::command_graph<graph_state::modifiable>
ext_oneapi_get_graph() const;
/* -- graph convenience shortcuts -- */
event ext_oneapi_graph(command_graph<graph_state::executable>& graph);
event ext_oneapi_graph(command_graph<graph_state::executable>& graph,
event depEvent);
event ext_oneapi_graph(command_graph<graph_state::executable>& graph,
const std::vector<event>& depEvents);
};
} // namespace sycl
This extension modifies the SYCL queue class such that state is introduced to queue objects, allowing an instance to be put into a mode where command-groups are recorded to a graph rather than submitted immediately for execution.
Three new member functions are also added to the
sycl::queue
class in this extension as queue shortcuts for handler::graph()
.
The sycl::queue
object can be in either of two states. The default
queue_state::executing
state is where the queue has its normal semantics of
submitted command-groups being immediately scheduled for asynchronous execution.
The alternative queue_state::recording
state is used for graph construction.
Instead of being scheduled for execution, command-groups submitted to the queue
are recorded to a graph object as new nodes for each submission. After recording
has finished and the queue returns to the executing state, the recorded commands are
not executed, they are transparent to any following queue operations. The state
of a queue can be queried with queue::ext_oneapi_get_state()
.
graph LR
Executing -->|Begin Recording| Recording
Recording -->|End Recording| Executing
Submitting a command-group to a queue in the executable state can implicitly
change its state to queue_state::recording
. This will occur when the
command-group depends on an event that has been returned by a queue in the
recording state. The change of state happens before the command-group is
submitted to the device (i.e. a new graph node will be created for that command-group).
A queue whose state has been set to queue_state::recording
using this
mechanism, will behave as if it had been passed as an argument to
command_graph::begin_recording()
. In particular, its state will not change
again until command_graph::end_recording()
is called.
The recording properties of the queue whose event triggered the state change
will also be inherited (i.e. any properties passed to the original call of
command_graph::begin_recording()
will be inherited by the queue whose state
is being transitioned).
// q1 state is set to recording.
graph.begin_recording(q1);
// Node is added to the graph by submitting to a recording queue.
auto e1 = q1.single_task(...);
// Since there is a dependency on e1 which was created by a queue being
// recorded, q2 immediately enters record mode, and a new node is created
// with an edge between e1 and e2.
auto e2 = q2.single_task(e1, ...);
// Ends recording on q1 and q2.
graph.end_recording();
There are two properties defined by the core SYCL
specification that can be passed to a sycl::queue
on construction via the
property list parameter. They interact with this extension in the following
ways:
-
property::queue::in_order
- When a queue is created with the in-order property, recording its operations results in a straight-line graph, as each operation has an implicit dependency on the previous operation. However, a graph submitted to an in-order queue will keep its existing structure such that the complete graph executes in-order with respect to the other command-groups submitted to the queue. The SYCL runtime automatically adds an implicit dependency before and after the graph execution, as if the graph execution is one command-group submitted to the in-order queue. -
property::queue::enable_profiling
- This property has no effect on graph recording. When set on the queue a graph is submitted to however, it allows profiling information to be obtained from the event returned by a graph submission. The executable graph used for this submission must have been created with theenable_profiling
property, see Enable-Profiling for more details. As it is not defined how a submitted graph will be split up for scheduling at runtime, theuint64_t
timestamp reported from a profiling query on a graph execution event has the following semantics, which may be pessimistic about execution time on device.-
info::event_profiling::command_submit
- Timestamp when the graph is submitted to the queue. -
info::event_profiling::command_start
- Timestamp when the first command-group node begins running. -
info::event_profiling::command_end
- Timestamp when the last command-group node completes execution.
-
Table 13. Additional member functions of the sycl::queue
class.
Member function | Description |
---|---|
queue_state
queue::ext_oneapi_get_state() const; |
Query the recording state of the queue. Returns: If the queue is in the default state where commands are scheduled
immediately for execution, |
command_graph<graph_state::modifiable>
queue::ext_oneapi_get_graph() const; |
Query the underlying command graph of a queue when recording. Returns: The graph object that the queue is recording commands into. Exceptions:
|
event
queue::ext_oneapi_graph(command_graph<graph_state::executable>& graph) |
Queue shortcut function that is equivalent to submitting a command-group
containing The command status of the event returned will be
The queue should be associated with a device and context that are the same as the device and context used on creation of the graph. |
event
queue::ext_oneapi_graph(command_graph<graph_state::executable>& graph,
event depEvent); |
Queue shortcut function that is equivalent to submitting a command-group
containing The command status of the event returned will be
The queue should be associated with a device and context that are the same as the device and context used on creation of the graph. |
event
queue::ext_oneapi_graph(command_graph<graph_state::executable>& graph,
const std::vector<event>& depEvents); |
Queue shortcut function that is equivalent to submitting a command-group
containing The command status of the event returned will be
The queue should be associated with a device and context that are the same as the device and context used on creation of the graph. |
Table 14. Additional member functions of the sycl::handler
class.
Member function | Description |
---|---|
void
handler::ext_oneapi_graph(command_graph<graph_state::executable>& graph) |
Invokes the execution of a graph. Only one instance of Parameters:
Exceptions:
|
template <typename DataT, int Dims, access::mode AccMode, access::target
AccTarget, access::placeholder IsPlaceholder> void
handler::require(ext::oneapi::experimental::dynamic_parameter<
accessor<DataT, Dims, AccMode, AccTarget, IsPlaceholder>>
dynamicParamAcc) |
Requires access to a memory object associated with an accessor contained in a dynamic parameter. Parameters:
Exceptions:
|
template <typename T>
void handler::set_arg(int argIndex,
ext::oneapi::experimental::dynamic_parameter<T> &dynamicParam); |
Sets an argument to a kernel based on the value inside a dynamic parameter, and registers that dynamic parameter with the graph node encapsulating the submission of the command-group that calls this function. Parameters:
Exceptions:
|
The new functions in this extension are thread-safe, the same as member functions of classes in the base SYCL specification. If user code does not perform synchronization between two threads accessing the same queue, there is no strong ordering between events on that queue, and the kernel submissions, recording and finalization will happen in an undefined order.
When one thread ends recording on a queue while another thread is submitting work, which kernels will be part of the subsequent graph is undefined. If user code enforces a total order on the queue events, then the behavior is well-defined, and will match the observable total order.
The returned value from the queue::ext_oneapi_get_state()
should be
considered immediately stale in multi-threaded usage, as another thread could
have preemptively changed the state of the queue.
In addition to the destruction semantics provided by the SYCL
common reference semantics, when the last copy of a modifiable
command_graph
is destroyed recording is ended on any queues that are recording
to that graph, equivalent to this->end_recording()
.
As a result, users don’t need to manually wrap queue recording code in a
try
/ catch
block to reset the state of recording queues on an exception
back to the executing state. Instead, an uncaught exception destroying the
modifiable graph will perform this action, useful in RAII pattern usage.
While not disallowed by the SYCL specification it should be noted that it is not
possible to capture arbitrary C++ code which is inside a CGF (Command-Group
Function) used to create a graph node. This code will be evaluated once during
the call to queue::submit()
or command_graph::add()
along with the calls to
handler functions and this will not be reflected on future executions of the
graph.
Similarly, any command-group function inside a dynamic_command_group
will be
evaluated once, in index order, when submitted to the graph using
command_graph::add()
.
Any code like this should be moved to a separate host-task and added to the graph via the recording or explicit APIs in order to be compatible with this extension.
A host task is a native C++ callable, scheduled according to SYCL
dependency rules. It is valid to record a host task as part of a graph, though it
may lead to sub-optimal graph performance because a host task node may prevent
the SYCL runtime from submitting the entire executable command_graph
to the
device at once.
auto node = graph.add([&](sycl::handler& cgh){
// Host code here is evaluated during the call to add()
cgh.host_task([=](){
// Code here is evaluated as part of executing the command graph node
});
});
When a queue is placed in recording mode via a call to command_graph::begin_recording
,
some features of the queue are no longer available because the commands are not
executed during this mode. The general philosophy is to throw an exception at
runtime when a feature is not available, so that there is an obvious indication
of failure. The following list describes the behavior that changes during
recording mode. Features not listed below behave the same in recording mode as
they do in non-recording mode.
For queue submissions that are being recorded to a modifiable command_graph
,
the only events that can be used as parameters to handler::depends_on()
, or
as dependent events for queue shortcuts like queue::parallel_for()
, are events
that have been returned from queue submissions recorded to the same modifiable
command_graph
.
Other limitations on the events returned from a submission to a queue in the recording state are:
-
Calling
event::get_info<info::event::command_execution_status>()
orevent::get_profiling_info()
will throw synchronously with error codeinvalid
. -
A host-side wait on the event will throw synchronously with error code
invalid
. -
Using the event outside of the recording scope will throw synchronously with error code
invalid
.
A host-side wait on a queue in the recording state is an error and will
throw synchronously with error code invalid
.
The use of buffers inside a command_graph
is restricted unless the user
creates the graph with the Assume-Buffer-Outlives-Graph
property. Buffer lifetimes are not extended by a command_graph
in which they
are used and so the user must ensure that their lifetimes exceed that of the
command_graph
. Attempting to use a buffer in a command_graph
without this
property will result in a synchronous error being throw with error code
invalid
.
There are also restrictions on using a buffer which has been created with a
host data pointer in commands recorded to a command_graph
. Because of the
delayed execution of a command_graph
, data may not be copied to the device
immediately when commands using these buffers are submitted to the graph,
therefore the host data must also outlive the graph to ensure correct behavior.
Because of the delayed execution of a recorded graph, it is not possible to support captured code which relies on the copy-back on destruction behavior of buffers. Typically, applications would rely on this behavior to do work on the host which cannot inherently be captured inside a command graph.
-
Thus, when recording to a graph it is an error to submit a command which has an accessor on a buffer which would cause a write-back to happen. Using an incompatible buffer in this case will result in a synchronous error being thrown with error code
invalid
. -
The copy-back mechanism can be disabled explicitly for buffers with attached host storage using either
buffer::set_final_data(nullptr)
orbuffer::set_write_back(false)
. -
It is also an error to create a host accessor to a buffer which is used in commands which are currently being recorded to a command graph. Attempting to construct a host accessor to an incompatible buffer will result in a synchronous error being thrown with error code
invalid
.
When a queue is in recording mode asynchronous exceptions will not be generated, as no device execution is occurring. Synchronous errors specified as being thrown in the default queue executing state, will still be thrown when a queue is in the recording state. Queue query methods operate as usual in recording mode, as opposed to throwing.
This section defines the interaction of sycl_ext_oneapi_graph
with other
extensions.
ext_codeplay_enqueue_native_command
, defined in
sycl_ext_codeplay_enqueue_native_command
cannot be used in graph nodes. A synchronous exception will be thrown with error
code invalid
if a user tries to add them to a graph.
Removing this restriction is something we may look at for future revisions of
sycl_ext_oneapi_graph
.
The compute index queue property defined by sycl_ext_intel_queue_index is ignored during queue recording.
Using this information is something we may look at for future revisions of
sycl_ext_oneapi_graph
.
The new handler methods, and queue shortcuts, defined by
sycl_ext_oneapi_bindless_images
cannot be used in graph nodes. A synchronous exception will be thrown with error
code invalid
if a user tries to add them to a graph.
Removing this restriction is something we may look at for future revisions of
sycl_ext_oneapi_graph
.
The new handler methods, and queue shortcuts, defined by
sycl_ext_oneapi_device_global.
cannot be used in graph nodes. A synchronous exception will be thrown with error
code invalid
if a user tries to add them to a graph.
Removing this restriction is something we may look at for future revisions of
sycl_ext_oneapi_graph
.
When recording a sycl::queue
which has been created with the
ext::oneapi::property::queue::discard_event
property, it is invalid to
use these events returned from queue submissions to create graph edges. This is
in-keeping with the
sycl_ext_oneapi_discard_queue_events
specification wording that handler::depends_on()
throws an exception when
passed an invalid event.
The new handler methods, and queue shortcuts, defined by sycl_ext_oneapi_enqueue_barrier can only be used in graph nodes created using the Record & Replay API, as barriers rely on events to enforce dependencies.
A synchronous exception will be thrown with error code invalid
if a user
tries to add a barrier command to a graph using the explicit API. Empty nodes
created with the node::depends_on_all_leaves
property can be used instead of
barriers when a user is building a graph with the explicit API.
The semantics of barriers are defined in sycl_ext_oneapi_enqueue_barrier
for
a single command-queue, and correlate as follows to a graph that may contain
nodes that are recorded from multiple queues and/or added by the explicit API:
-
Barriers with an empty wait list parameter will only depend on the leaf nodes that were added to the graph from the queue the barrier command is being recorded from.
-
The only commands which have an implicit dependency on the barrier command are those recorded from the same queue the barrier command was submitted to.
The command submission functions defined in sycl_ext_oneapi_enqueue_functions can be used adding nodes to a graph when creating a graph from queue recording. New methods are also defined that enable submitting an executable graph, e.g. directly to a queue without returning an event.
sycl_ext_oneapi_free_function_kernels
, defined in
sycl_ext_oneapi_free_function_kernels
can be used with SYCL Graphs.
The kernels loaded using sycl_ext_oneapi_kernel_compiler_spirv behave as normal when used in graph nodes.
The new handler methods, and queue shortcuts, defined by sycl_ext_oneapi_kernel_properties can be used in graph nodes in the same way as they are used in normal queue submission.
Allocating local memory inside a graph kernel node with sycl::ext::oneapi::group_local_memory()
or
sycl::ext::oneapi::group_local_memory_for_overwrite()
is supported. These methods are defined by
sycl_ext_oneapi_local_memory.
The new handler methods, and queue shortcuts, defined by
sycl_ext_oneapi_memcpy2d
cannot be used in graph nodes. A synchronous exception will be thrown with
error code invalid
if a user tries to add them to a graph.
Removing this restriction is something we may look at for future revisions of
sycl_ext_oneapi_graph
.
The new sycl::queue::ext_oneapi_prod()
method added by
sycl_ext_oneapi_prod
behaves as normal during queue recording and is not captured to the graph.
Recorded commands are not counted as submitted for the purposes of its operation.
The queue::ext_oneapi_empty()
query defined by the
sycl_ext_oneapi_queue_empty
extension behaves as normal during queue recording and is not captured to the graph.
Recorded commands are not counted as submitted for the purposes of this query.
The queue priority property defined by sycl_ext_oneapi_queue_priority is ignored during queue recording.
Using the work_group_memory
object defined in
sycl_ext_oneapi_work_group_memory
inside graph kernel nodes is supported.
The new property defined by
sycl_ext_oneapi_work_group_scratch_memory
cannot be used in graph nodes. A synchronous exception will be thrown with error
code invalid
if a user tries to add them to a graph.
Removing this restriction is something we may look at for future revisions of
sycl_ext_oneapi_graph
.
Detailed code examples and usage guidelines are provided in the SYCL Graph Usage Guide.
This section contains both features of the specification which have been fully developed, but are not yet implemented, as well as features which are still in development.
Fully developed features will be moved to the main specification once they have been implemented.
The lifetime of any buffer recorded as part of a submission to a command graph will be extended in keeping with the common reference semantics and buffer synchronization rules in the SYCL specification. It will be extended either for the lifetime of the graph (including both modifiable graphs and the executable graphs created from them) or until the buffer is no longer required by the graph (such as after being replaced through executable graph update).
If a buffer created with a host data pointer is recorded as part of a submission to a command graph, the lifetime of that host data will also be extended by taking a copy of that data inside the buffer. To illustrate, consider the following example:
void foo(queue q /* queue in recording mode */ ) {
float data[NUM];
buffer buf{data, range{NUM}};
q.submit([&](handler &cgh) {
accessor acc{buf, cgh, read_only};
cgh.single_task([] {
// use "acc"
});
});
// "data" goes out of scope
}
In this example, the implementation extends the lifetime of the buffer because it is used in the recorded graph. Because the buffer uses the host memory data, the implementation also makes an internal copy of that host data. As illustrated above, that host memory might go out of scope before the recorded graph goes out of scope, or before the data has been copied to the device.
The default behavior is to always copy the host data in a case like this, but
this is not necessary if the user knows that the lifetime of the host data
outlives the lifetime of the recorded graph. If the user knows this is the
case, they may use the graph::assume_data_outlives_buffer
property to avoid the internal
copy. Passing the property to begin_recording()
will prevent host copies only
for commands recorded before end_recording()
is called for a given queue.
Passing the property to the command_graph
constructor will prevent host copies
for all commands recorded to the graph.
The implementation guarantees that the host memory will not be copied internally
if all the commands accessing this buffer use access_mode::write
or the
no_init
property because the host memory is not needed in these cases.
Note, however, that these cases require the application to disable copy-back
as described in Buffer Limitations.
There is no provided interface for users to define a USM allocation/free operation belonging to the scope of the graph. It would be error prone and non-performant to allocate or free memory as a node executed during graph submission. Instead, such a memory allocation API needs to provide a way to return a pointer which won’t be valid until the allocation is made on graph finalization, as allocating at finalization is the only way to benefit from the known graph scope for optimal memory allocation, and even optimize to eliminate some allocations entirely.
Such a deferred allocation strategy presents challenges however, and as a result we recommend instead that prior to graph construction users perform core SYCL USM allocations to be used in the graph submission. Before to coming to this recommendation we considered the following explicit graph building interfaces for adding a memory allocation owned by the graph:
-
Allocation function returning a reference to the raw pointer, i.e.
void*&
, which will be instantiated on graph finalization with the location of the allocated USM memory. -
Allocation function returning a handle to the allocation. Applications use the handle in node command-group functions to access memory when allocated.
-
Allocation function returning a pointer to a virtual allocation, only backed with an actual allocation when graph is finalized or submitted.
Design 1) has the drawback of forcing users to keep the user pointer variable alive so that the reference is valid, which is unintuitive and is likely to result in bugs.
Design 2) introduces a handle object which has the advantages of being a less error prone way to provide the pointer to the deferred allocation. However, it requires kernel changes and introduces an overhead above the raw pointers that are the advantage of USM.
Design 3) needs specific backend support for deferred allocation.
A modifiable state command_graph
contains nodes targeting specific devices,
rather than being a device agnostic representation only tied to devices on
finalization. This allows the implementation to process nodes which require
device information when the command group function is evaluated. For example,
a SYCL reduction implementation may desire the work-group/sub-group size, which
is normally gathered by the runtime from the device associated with the queue.
This design also enables the future capability for a user to compose a graph with nodes targeting different devices, allowing the benefits of defining an execution graph ahead of submission to be extended to multi-device platforms. Without this capability a user currently has to submit individual single-device graphs and use events for dependencies, which is a usage model this extension is aiming to optimize. Automatic load balancing of commands across devices is not a problem this extension currently aims to solve, it is the responsibility of the user to decide the device each command will be processed for, not the SYCL runtime.
Support updating arguments to types of nodes other than kernel execution commands.
UNRESOLVED Should be added for at least memory copy nodes and host-tasks. However, the full scope of support needs to be designed and implemented.
It has been suggested that updatable graphs could remove the dependencies generated between graphs upon resubmission while a previous submission of the same graph is still executing. However, this requires further design discussion to ensure this is desired and makes sense to users.
UNRESOLVED Needs more discussion.
Currently the only way to update command-groups in a graph is to use the Explicit API. There is a limitation in some backends that requires all the command-groups used for updating to be specified before the graph is finalized. This restriction makes it hard to implement the Record & Replay API in a performant manner.
UNRESOLVED Needs more discussion.
Allow an executable graph to contain nodes targeting different devices.
UNRESOLVED: Trending "yes". This feature is something that we are considering introducing into the extension in later revisions. It has been planned for to the extent that the definition of a graph node is device specific.
We would like to provide an API that allows graph scope memory to be allocated and used in nodes, such that optimizations can be done on the allocation. No mechanism is currently provided, but see the section on Memory Allocation Nodes for some designs being considered.
UNRESOLVED: Trending "yes". Design is under consideration.
Explicit API could support device-agnostic graphs that can be submitted through queues to a particular device. This issue is related to multi-device graphs.
UNRESOLVED: Trending "no". Because of current runtime limitations this can’t be implemented with a reasonable effort.
Current proposal contains extensive extensions to existing API in SYCL. Can we achieve something similar with user control over the flush behavior of a queue and providing a handler that can be replayed?
UNRESOLVED: Trending "no". Needs reconsideration of the design and possible restrictions.
For specific workloads it could be beneficial to provide hints to the runtime how to schedule a command graph onto a device. This info could effect the scheduling policy like breadth or depth-first, or a combination with a block size.
UNRESOLVED: Trending "yes". A new property could be added to the finalize call either extending the basic command graph proposal or layered as a separate extension proposal.
The following features are not yet supported, and an exception will be thrown if used in application code.
-
Using reductions in a graph node.
-
Using sycl streams in a graph node.
-
Synchronization between multiple executions of the same command-buffer must be handled in the host for level-zero backend, which may involve extra latency for subsequent submissions.