Copyright © 2023 Intel Corporation. All rights reserved.
Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.
To report problems with this extension, please open a new issue at:
This extension is written against the SYCL 2020 revision 7 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.
This extension also depends on the following other SYCL extensions:
This is an experimental extension specification, intended to provide early access to features and gather community feedback. Interfaces defined in this specification are implemented in DPC, but they are not finalized and may change incompatibly in future versions of DPC without prior notice. Shipping software products should not rely on APIs defined in this specification.
Many devices targeted by SYCL support software prefetch operations, which act as asynchronous memory reads intended to populate cache(s). Prefetches are generally used to hide memory latency, and their deployment may be a critical component for software tuning on some hardware.
The multi_ptr::prefetch
function provided by SYCL 2020 is insufficient to
cover many real use-cases — it does not acknowledge the existence of a cache
hierarchy, and does not provide a mechanism for cooperative (i.e. group)
prefetches.
This proposal addresses these shortcomings by introducing a set of free functions that groups of work-items can use to prefetch data into specific levels of cache (controlled via compile-time properties).
This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro SYCL_EXT_ONEAPI_PREFETCH
to one of the values defined in the table
below. Applications can test for the existence of this macro to determine if
the implementation supports this feature, or applications can test the macro’s
value to determine which of the extension’s features the implementation
supports.
Value | Description |
---|---|
1 |
The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. |
The level of cache targeted by a prefetch is specified using compile-time
properties. This extension defines hints for four levels of cache,
corresponding to the four levels of cache currently defined in SYCL (e.g.
as used by info::partition_affinity_domain
).
Note
|
Not all devices targeted by SYCL will have four levels of cache. Some of these hints may have no meaning on some devices. However, implementations are encouraged to map these constants to the closest level of cache available. |
When multiple cache levels are specified, the lowest level takes precedence (e.g. a request to prefetch into L1 and L4 is treated as a request to prefetch into L1). When no cache levels are specified, the default behavior is to prefetch into the lowest level cache (i.e. L1).
Note
|
Future hints may alter the default behavior of prefetches with respect to cache levels. Any such alterations are expected to be documented in the definition of those new hints. |
namespace sycl::ext::oneapi::experimental {
enum class cache_level {
L1,
L2,
L3,
L4,
};
struct nontemporal;
struct prefetch_hint_key {
template <cache_level Level, typename Hint>
using value_t = property_value<prefetch_hint_key,
std::integral_constant<cache_level, Level>,
Hint>;
};
template <cache_level Level, typename Hint>
inline constexpr prefetch_hint_key::value_t<Level, Hint> prefetch_hint;
inline constexpr prefetch_hint_key::value_t<cache_level::L1, void> prefetch_hint_L1;
inline constexpr prefetch_hint_key::value_t<cache_level::L2, void> prefetch_hint_L2;
inline constexpr prefetch_hint_key::value_t<cache_level::L3, void> prefetch_hint_L3;
inline constexpr prefetch_hint_key::value_t<cache_level::L4, void> prefetch_hint_L4;
inline constexpr prefetch_hint_key::value_t<cache_level::L1, nontemporal> prefetch_hint_L1_nt;
inline constexpr prefetch_hint_key::value_t<cache_level::L2, nontemporal> prefetch_hint_L2_nt;
inline constexpr prefetch_hint_key::value_t<cache_level::L3, nontemporal> prefetch_hint_L3_nt;
inline constexpr prefetch_hint_key::value_t<cache_level::L4, nontemporal> prefetch_hint_L4_nt;
} // namespace sycl::ext::oneapi::experimental
If the nontemporal
type is used as the value of the Hint
template
parameter, this acts as a hint that the data being prefetched will not be
reused.
Note
|
The implementation of |
The functions in this section allow individual work-items to prefetch data.
namespace sycl::ext::oneapi::experimental {
template <typename Properties = empty_properties_t>
void prefetch(void* ptr, Properties properties = {});
template <typename Properties = empty_properties_t>
void prefetch(void* ptr, size_t bytes, Properties properties = {});
template <typename T, typename Properties = empty_properties_t>
void prefetch(T* ptr, Properties properties = {});
template <typename T, typename Properties = empty_properties_t>
void prefetch(T* ptr, size_t count, Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<void, AddressSpace, IsDecorated> ptr, Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<void, AddressSpace, IsDecorated> ptr, size_t bytes, Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <typename T, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<T, AddressSpace, IsDecorated> ptr, Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <typename T, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<T, AddressSpace, IsDecorated> ptr, size_t count, Properties properties = {});
// Only available if Dimensions > 0 && (AccessMode == read || AccessMode == read_write)
template <typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void prefetch(accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
id<Dimensions> offset, Properties properties = {});
// Only available if Dimensions > 0 && (AccessMode == read || AccessMode == read_write)
template <typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void prefetch(accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
id<Dimensions> offset, size_t count, Properties properties = {});
} // namespace sycl::ext::oneapi::experimental
template <typename Properties = empty_properties_t>
void prefetch(void* ptr, Properties properties = {});
Constraints: Available only if is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Acts as a hint to the implementation that the cacheline containing
the byte at ptr
should be prefetched into the levels of cache specified by
properties
.
template <typename Properties = empty_properties_t>
void prefetch(void* ptr, size_t bytes, Properties properties = {});
Constraints: Available only if is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Acts as a hint to the implementation that the cachelines containing
the bytes
bytes starting at ptr
should be prefetched into the levels of
cache specified by properties
.
template <typename T, typename Properties = empty_properties_t>
void prefetch(T* ptr, Properties properties = {});
Constraints: Available only if is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Equivalent to prefetch((void*) ptr, sizeof(T), properties)
.
template <typename T, typename Properties = empty_properties_t>
void prefetch(T* ptr, size_t count, Properties properties = {});
Constraints: Available only if is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Equivalent to prefetch((void*) ptr, count * sizeof(T), properties)
.
template <access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<void, AddressSpace, IsDecorated> ptr, Properties properties = {});
Constraints: Available only if AddressSpace == global_space || AddressSpace
== generic_space
is true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Equivalent to prefetch(ptr.get(), properties)
.
template <access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<void, AddressSpace, IsDecorated> ptr, size_t bytes, Properties properties = {});
Constraints: Available only if AddressSpace == global_space || AddressSpace
== generic_space
is true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Equivalent to prefetch(ptr.get(), bytes, properties)
.
template <typename T, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<T, AddressSpace, IsDecorated> ptr, Properties properties = {});
Constraints: Available only if AddressSpace == global_space || AddressSpace
== generic_space
is true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Equivalent to prefetch(ptr.get(), properties)
.
template <typename T, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void prefetch(multi_ptr<T, AddressSpace, IsDecorated> ptr, size_t count, Properties properties = {});
Constraints: Available only if AddressSpace == global_space || AddressSpace
== generic_space
is true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory.
Effects: Equivalent to prefetch(ptr.get(), count, properties)
.
template <typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void prefetch(accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
id<Dimensions> offset, Properties properties = {});
Constraints: Available only if Dimensions > 0 && (AccessMode == read ||
AccessMode == read_write)
is true
and
is_property_list_v<std::decay_t<Properties>>
is true
.
Effects: Equivalent to prefetch((void*) &acc[offset], sizeof(DataT),
properties)
.
template <typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void prefetch(accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
size_t offset, size_t count, Properties properties = {});
Constraints: Available only if Dimensions > 0 && (AccessMode == read ||
AccessMode == read_write)
is true
and
is_property_list_v<std::decay_t<Properties>>
is true
.
Effects: Equivalent to prefetch((void*) &acc[offset], count * sizeof(DataT),
properties)
.
namespace syclex = sycl::ext::oneapi::experimental;
q.parallel_for(N, [=](auto i) {
for (int j = 0; j < M; ++j) {
syclex::prefetch(&data[j + 10], syclex::properties{syclex::prefetch_hint_L1});
syclex::prefetch(&data[j + 100], syclex::properties{syclex::prefetch_hint_L3});
foo(data[j]);
}
});
namespace syclex = sycl::ext::oneapi::experimental;
q.parallel_for(N, [=](auto i) {
for (int j = 0; j < M; ++j) {
syclex::prefetch(&data[j + 10], syclex::properties{syclex::prefetch_hint<syclex::cache_level::L1, syclex::nontemporal>});
foo(data[j]);
}
});
The functions in this section allow groups of work-items to cooperatively prefetch the same data. These functions are all group functions, as defined in Section 4.17.3 of the SYCL specification.
Note
|
Although calling |
namespace sycl::ext::oneapi::experimental {
template <typename Group, typename Properties = empty_properties_t>
void joint_prefetch(Group g, void* ptr, Properties properties = {});
template <typename Group, typename Properties = empty_properties_t>
void joint_prefetch(Group g, void* ptr, size_t bytes, Properties properties = {});
template <typename Group, typename T, typename Properties = empty_properties_t>
void joint_prefetch(Group g, T* ptr, Properties properties = {});
template <typename Group, typename T, typename Properties = empty_properties_t>
void joint_prefetch(Group g, T* ptr, size_t count, Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <typename Group, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<void, AddressSpace, IsDecorated> ptr,
Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <typename Group, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<void, AddressSpace, IsDecorated> ptr, size_t bytes,
Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <typename Group, typename T,
access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<T, AddressSpace, IsDecorated> ptr,
Properties properties = {});
// Only available if AddressSpace == global_space || AddressSpace == generic_space
template <typename Group, typename T,
access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<T, AddressSpace, IsDecorated> ptr, size_t count,
Properties properties = {});
// Only available if Dimensions > 0 && (AccessMode == read || AccessMode == read_write)
template <typename Group, typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
size_t offset, Properties properties = {});
// Only available if Dimensions > 0 && (AccessMode == read || AccessMode == read_write)
template <typename Group, typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
size_t offset, size_t count, Properties properties = {});
} // namespace sycl::ext::oneapi::experimental
template <typename Group, typename Properties = empty_properties_t>
void joint_prefetch(Group g, void* ptr, Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
and
properties
must be the same for all work-items in group g
.
Effects: Acts as a hint to the implementation that the cacheline containing
the byte at ptr
should be prefetched into the levels of cache specified by
properties
.
template <typename Group, typename Properties = empty_properties_t>
void joint_prefetch(Group g, void* ptr, size_t bytes, Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
, bytes
and properties
must be the same for all work-items in group g
.
Effects: Acts as a hint to the implementation that the cachelines containing
the bytes
bytes starting at ptr
should be prefetched into the levels of
cache specified by properties
.
template <typename Group, typename T, typename Properties = empty_properties_t>
void joint_prefetch(Group g, T* ptr, Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
and
properties
must be the same for all work-items in group g
.
Effects: Equivalent to joint_prefetch(g, (void*) ptr, sizeof(T),
properties)
.
template <typename Group, typename T, typename Properties = empty_properties_t>
void joint_prefetch(Group g, T* ptr, size_t count, Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
, count
and properties
must be the same for all work-items in group g
.
Effects: Equivalent to joint_prefetch(g, (void*) ptr, count * sizeof(T),
properties)
.
template <typename Group, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<void, AddressSpace, IsDecorated> ptr,
Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and AddressSpace == global_space || AddressSpace == generic_space
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
and
properties
must be the same for all work-items in group g
.
Effects: Equivalent to joint_prefetch(g, ptr.get(), properties)
.
template <typename Group, access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<void, AddressSpace, IsDecorated> ptr, size_t bytes,
Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and AddressSpace == global_space || AddressSpace == generic_space
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
, bytes
and properties
must be the same for all work-items in group g
.
Effects: Equivalent to joint_prefetch(g, ptr.get(), bytes, properties)
.
template <typename Group, typename T,
access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<T, AddressSpace, IsDecorated> ptr,
Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and AddressSpace == global_space || AddressSpace == generic_space
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
and
properties
must be the same for all work-items in group g
.
Effects: Equivalent to joint_prefetch(g, ptr.get(), properties)
.
template <typename Group, typename T,
access::address_space AddressSpace, access::decorated IsDecorated,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, multi_ptr<T, AddressSpace, IsDecorated> ptr, size_t count,
Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and is_property_list_v<std::decay_t<Properties>>
is true
.
Preconditions: ptr
must point to an object in global memory. ptr
, count
and properties
must be the same for all work-items in group g
.
Effects: Equivalent to joint_prefetch(g, ptr.get(), count, properties)
.
template <typename Group, typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
size_t offset, Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and Dimensions > 0 && (AccessMode == read || AccessMode ==
read_write)
is true
and is_property_list_v<std::decay_t<Properties>>
is
true
.
Preconditions: acc
, offset
and properties
must be the same for all
work-items in group g
.
Effects: Equivalent to joint_prefetch(g, (void*) &acc[offset],
sizeof(DataT), properties)
.
template <typename Group, typename DataT, int Dimensions,
access_mode AccessMode, access::placeholder IsPlaceholder,
typename Properties = empty_properties_t>
void joint_prefetch(Group g, accessor<DataT, Dimensions, AccessMode, target::device, IsPlaceholder> acc,
size_t offset, size_t count, Properties properties = {});
Constraints: Available only if sycl::is_group_v<std::decay_t<Group>>
is
true
and Dimensions > 0 && (AccessMode == read || AccessMode ==
read_write)
is true
and is_property_list_v<std::decay_t<Properties>>
is
true
.
Preconditions: acc
, offset
, count
and properties
must be the same for
all work-items in group g
.
Effects: Equivalent to joint_prefetch(g, (void*) &acc[offset], count *
sizeof(DataT), properties)
.
namespace syclex = sycl::ext::oneapi::experimental;
q.parallel_for(sycl::nd_range{N, L}, [=](sycl::nd_item<1> it) {
auto sg = it.get_sub_group();
for (int j = sg.get_local_id(); j < M; j += sg.get_max_local_range()) {
syclex::joint_prefetch(sg, &data[j + 100], sg.get_max_local_range(), syclex::properties{syclex::prefetch_hint_L3});
foo(sg, data[j]);
}
});
-
Which level of cache should be targeted for an empty property list?
UNRESOLVED: Defaulting to the lowest level of cache may be expected by some users, who would like the prefetch to place data as close to the compute units as possible. Defaulting to the highest level of cache may be expected by other users, since that level typically has the highest capacity and may contain data from all other levels — naive usage of prefetches in this case would be less likely to cause thrashing across multiple levels of cache.
The current draft of this extension sets the default as the lowest level, consistent with the behavior of the
prefetch
pragmas proposed for OpenMP. Developers who want to prefetch data into specific levels of cache can simply override this behavior, and can prefetch into the last level of cache by specifingprefetch_hint_L4
. -
How should multi-dimensional prefetches be handled?
UNRESOLVED: Some developers think of multi-dimensional accessors in terms of the underlying (linearized) memory, and would expect to describe prefetches in terms of scalar counts. Other developers might expect prefetches using multi-dimensional accessors to accept counts described using
range
objects.