@@ -71,61 +71,103 @@ Hyesun Hong,
71
71
72
72
* PIM/PNM technology enables computation directly on memory
73
73
* Prevents data movement improving performance and reducing consumption
74
- * PIM operates directly on memory banks by reading and storing on rows and columns
74
+ * Operates directly on memory banks by reading and storing on rows and columns
75
75
* Aquabolt-XL is the first demonstrator
76
76
* Can be drop in on any memory controller
77
77
* CXL-PNM is the CXL variant for PNM, can work with multiple PIM
78
78
79
79
SYCL Extension for PIM/PNM
80
- * Goals
81
- * Seamlessly integrate PIM/PNM operation into SYCL
82
- * Allow combination of xGPU and PIM/PNM in one device kernel
83
- * Not specific to one hardware
84
- * Design
85
- * Vector operation seem like natural fit, but no convergence guarantee and vector size explicit
86
- * Model as special function unit
87
- * Aligns with trends to model special functional units inside accelerators
88
- * Compiler automatic mapping often not possible
89
- * joint_matrix
90
- * Group functions
91
- * Easy to use
92
- * Can easily be combined with device code
93
- * Give necessary convergence guarantees
94
- * Recap of SYCL work-item, work-group and group functions
95
- * Group functions must be encountered in converged control flow
80
+ * Work in collaboration with Codeplay Software team
81
+ * Goals
82
+
83
+ * Seamlessly integrate PIM/PNM operation into SYCL
84
+ * Allow combination of xGPU and PIM/PNM in one device kernel
85
+ * Not specific to one hardware
86
+
87
+ * Design
88
+
89
+ * Vector operation seem like natural fit
90
+ * no convergence guarantee and vector size explicit
91
+
92
+ * Model as special function unit
93
+
94
+ * Aligns with trends to model special functional units inside accelerators
95
+ * Compiler automatic mapping often not possible
96
+ * joint_matrix-like interface
97
+
98
+
99
+ * Group functions
100
+
101
+ * Easy to use
102
+ * Can easily be combined with device code
103
+ * Give necessary convergence guarantees
104
+
105
+
106
+ * Recap of SYCL work-item, work-group and group functions
107
+
108
+ * Group functions must be encountered in converged control flow
109
+
96
110
* Extension
97
- * Extended group functions with additional overload of joint_reduce and new joint_transform and joint_inner_product
98
- * Block size as template parameter, number of blocks as runtime parameter -> allows calculation of number of elements to process
111
+
112
+ * Extended group functions with additional overload of joint_reduce
113
+ * and new joint_transform and joint_inner_product
114
+ * Block size as template parameter, number of blocks as runtime parameter
115
+ * allows calculation of number of elements to process
116
+
99
117
* Extension for PNM
100
- * Added new overloads of joint_exclusive_scan, joint_inclusive_scan, reduce_over_group
101
- * PNM standalone has less opportunity for parallelism, also limited by memory controller
102
- * -> Combine PNM and PIM, PNM generates commands for PIM blocks
118
+
119
+ * Added new overloads of joint_exclusive_scan,
120
+ * joint_inclusive_scan, reduce_over_group
121
+
122
+ * PNM standalone has less opportunity for parallelism
123
+
124
+ * limited by memory controller
125
+ * -> Combine PNM and PIM, PNM generates commands for PIM blocks
126
+
103
127
* Two modes
128
+
104
129
* PIM mode: PIM blocks can operate independently, can choose number of blocks
105
130
* PNM mode: Synchronized execution on multiple PIM blocks
131
+
106
132
* Mapping
133
+
107
134
* Every PIM block is one work-item
108
135
* PNM with attached PIM blocks forms one work-group
136
+
109
137
* Execution
110
- * Work-item operations map to PIM operation
111
- * Group functions map to PNM operation
138
+
139
+ * Work-item operations map to PIM operation
140
+ * Group functions map to PNM operation
141
+
112
142
* Example
143
+
113
144
* work-item execution maps to PIM
114
145
* group function maps to PNM
146
+
115
147
* Conclusion
148
+
116
149
* Integrate support for PIM/PNM into SYCL
117
150
118
151
Q&A
119
- * Are the proposed functions specific to PIM or could also be used with other HW?
120
- * Can also be used with other hardware. Semantics not PIM-specific, but translation of C++ to SYCL
121
- * Can also map nicely to other types of hardware, for example vector processor
152
+ * Are the proposed functions specific to PIM, could also be used with other HW?
153
+
154
+ * Can also be used with other hardware.
155
+ * Semantics not PIM-specific, but translation of C++ to SYCL
156
+ * Can also map nicely to other types of hardware, e.g. vector processor
157
+
122
158
* Why have the user explicitly specify a block-size?
123
- * Not a hardware detail
124
- * Rather a promise by the user that data-blocks will always be at least that big
125
- * Promise allows device compiler to perform optimizations, efficient looping inside PIM unit
126
- * Could num_blocks runtime parameter be replaced by iterator, requiring to be divisable by block-size
127
- * Yes, that is possible, mainly a design question
128
- * Current version might have additional implications regarding alignment
159
+
160
+ * Not a hardware detail
161
+ * Rather a promise by the user that data-blocks
162
+ will always be at least that big
163
+ * Promise allows device compiler to perform optimizations,
164
+ efficient looping inside PIM unit
165
+
166
+ * Could num_blocks runtime parameter be replaced by iterator?
167
+
168
+ * requires to be divisable by block-size
169
+ * Yes, that is possible, mainly a design question
170
+ * Current version might have additional implications regarding alignment
129
171
130
172
131
173
2023-06-05
0 commit comments