@@ -23,10 +23,114 @@ Potential Topics
23
23
* Function pointers revisited
24
24
* oneDPL C++ standard library support
25
25
26
+ 2023-09-19
27
+ =============
28
+
29
+ * Ruyman Reyes (Intel/Codeplay)
30
+ * Lukas Sommer (Codeplay Software Ltd)
31
+ * Benie (Codeplay Software Ltd)
32
+ * Hyesun Hong (Samsung SAIT)
33
+ * Julian Oppermann (Codeplay Software Ltd)
34
+ * Mehdi Goli (Codeplay Software Ltd)
35
+ * Lueck, Gregory (Intel)
36
+ * Jesus Labarta (BSC) (Guest)
37
+ * Brodman, James (Intel)
38
+ * Hanwoong Jung (Samsung SAIT)
39
+ * Brice Goglin (Invité)
40
+ * Plaska, Oskar (Contractor, Cognizant)
41
+ * Tom Deakin (Univ. of Bristol)
42
+ * Marcin (N/A)
43
+ * Victor Lomuller (Codeplay Software Ltd)
44
+ * Biagio COSENZA (Università degli Studi di Salerno)
45
+ * Voss, Michael J (Intel)
46
+ * Kukanov, Alexey (Intel)
47
+ * Richards, Alison L (Intel)
48
+ * Adam Kuźniar (Mobica)
49
+ * Slavova, Gergana S (Intel)
50
+ * bongjun kim (Samsung SAIT)
51
+ * Keryell, Ronan (XILINX LABS)
52
+ * Juan Fumero (University of Manchester)
53
+ * Gordon Brown (Codeplay Software Ltd)
54
+ * Tim (N/A)
55
+ * Kinsner, Michael (Intel)
56
+ * Petersen, Paul (Intel)
57
+ * Videau, Brice (ANL)
58
+ * Holmes, Daniel John (Intel)
59
+ * Frank Brill (Cadence)
60
+ * Mrozek, Michal (Intel)
61
+ * Reble, Pablo (Intel)
62
+ * Andrew Richards (Intel/Codeplay)
63
+ * Smith, Timmie (Intel)
64
+
65
+
66
+ SYCL Extension Proposal for PIM/PNM
67
+ --------------------------------------
68
+
69
+ Hyesun Hong,
70
+ `Slides <presentation/2023-09-19-HS-sycl-pim-extensions.pdf> `
71
+
72
+ * PIM/PNM technology enables computation directly on memory
73
+ * Prevents data movement improving performance and reducing consumption
74
+ * PIM operates directly on memory banks by reading and storing on rows and columns
75
+ * Aquabolt-XL is the first demonstrator
76
+ * Can be drop in on any memory controller
77
+ * CXL-PNM is the CXL variant for PNM, can work with multiple PIM
78
+
79
+ SYCL Extension for PIM/PNM
80
+ * Goals
81
+ * Seamlessly integrate PIM/PNM operation into SYCL
82
+ * Allow combination of xGPU and PIM/PNM in one device kernel
83
+ * Not specific to one hardware
84
+ * Design
85
+ * Vector operation seem like natural fit, but no convergence guarantee and vector size explicit
86
+ * Model as special function unit
87
+ * Aligns with trends to model special functional units inside accelerators
88
+ * Compiler automatic mapping often not possible
89
+ * joint_matrix
90
+ * Group functions
91
+ * Easy to use
92
+ * Can easily be combined with device code
93
+ * Give necessary convergence guarantees
94
+ * Recap of SYCL work-item, work-group and group functions
95
+ * Group functions must be encountered in converged control flow
96
+ * Extension
97
+ * Extended group functions with additional overload of joint_reduce and new joint_transform and joint_inner_product
98
+ * Block size as template parameter, number of blocks as runtime parameter -> allows calculation of number of elements to process
99
+ * Extension for PNM
100
+ * Added new overloads of joint_exclusive_scan, joint_inclusive_scan, reduce_over_group
101
+ * PNM standalone has less opportunity for parallelism, also limited by memory controller
102
+ * -> Combine PNM and PIM, PNM generates commands for PIM blocks
103
+ * Two modes
104
+ * PIM mode: PIM blocks can operate independently, can choose number of blocks
105
+ * PNM mode: Synchronized execution on multiple PIM blocks
106
+ * Mapping
107
+ * Every PIM block is one work-item
108
+ * PNM with attached PIM blocks forms one work-group
109
+ * Execution
110
+ * Work-item operations map to PIM operation
111
+ * Group functions map to PNM operation
112
+ * Example
113
+ * work-item execution maps to PIM
114
+ * group function maps to PNM
115
+ * Conclusion
116
+ * Integrate support for PIM/PNM into SYCL
117
+
118
+ Q&A
119
+ * Are the proposed functions specific to PIM or could also be used with other HW?
120
+ * Can also be used with other hardware. Semantics not PIM-specific, but translation of C++ to SYCL
121
+ * Can also map nicely to other types of hardware, for example vector processor
122
+ * Why have the user explicitly specify a block-size?
123
+ * Not a hardware detail
124
+ * Rather a promise by the user that data-blocks will always be at least that big
125
+ * Promise allows device compiler to perform optimizations, efficient looping inside PIM unit
126
+ * Could num_blocks runtime parameter be replaced by iterator, requiring to be divisable by block-size
127
+ * Yes, that is possible, mainly a design question
128
+ * Current version might have additional implications regarding alignment
129
+
130
+
26
131
2023-06-05
27
132
==========
28
133
29
-
30
134
* Ruyman Reyes
31
135
* Rod Burns
32
136
* Cohn, Robert S
0 commit comments