@@ -61,33 +61,33 @@ Potential Topics
61
61
Joint Matrix: A Unified SYCL Extension for Matrix Hardware Programming
62
62
------------------------------------------------------------------------
63
63
64
- Dounia Khaldi,
64
+ Dounia Khaldi,
65
65
`Slides <presentations/2023-06-07-DK-matrix-oneapi-language.pdf.pdf >`_
66
66
67
67
* Great community reception, with contributions to MLIR dialects upstream
68
- * Different levels of abstraction are exposed to different users.
69
- Joint matrix is aim for the middle.
70
- Breaks down gemm into primitives, its low level,
68
+ * Different levels of abstraction are exposed to different users.
69
+ Joint matrix is aim for the middle.
70
+ Breaks down gemm into primitives, its low level,
71
71
but its portable across targets.
72
- * This presentation will cover both the SPIR-V and the SYCL extension,
72
+ * This presentation will cover both the SPIR-V and the SYCL extension,
73
73
both are needed for different targets
74
- * Joint matrix relies in various abstractions for Code generation
74
+ * Joint matrix relies in various abstractions for Code generation
75
75
(PTX ISA, SPIRV, GPU intrinsics..)
76
- * Joint matrix is not a replacement of the framework and the libraries,
77
- this is useful when implementing new operations
76
+ * Joint matrix is not a replacement of the framework and the libraries,
77
+ this is useful when implementing new operations
78
78
or optimizing unexpected combinations of operations
79
- * This is also useful for library developers,
79
+ * This is also useful for library developers,
80
80
they need to write code that is portable
81
- * Intel PVC has two stacks with slides, each slide has 16 XE core,
81
+ * Intel PVC has two stacks with slides, each slide has 16 XE core,
82
82
8 vector engines and 8 XMX engines (GEMM accelerators)
83
- * Intel Xeon codenamed Sapphire Rapids have AMX extensions,
83
+ * Intel Xeon codenamed Sapphire Rapids have AMX extensions,
84
84
which are GEMM accelerators
85
85
* NVIDIA and other hardware vendors have their own GEMM accelerators
86
86
* You would need a lot of different intrinsics to target all of them
87
- * The SYCL Matrix extension is an experimental SYCL API at this point
87
+ * The SYCL Matrix extension is an experimental SYCL API at this point
88
88
so it may change from one release to the other
89
- * The joint matrix has a type of group, only subgroup is supported.
90
- Use is the matrix A,B or accumulator for GEMM,
89
+ * The joint matrix has a type of group, only subgroup is supported.
90
+ Use is the matrix A,B or accumulator for GEMM,
91
91
then you specify the shape (Rows, columns) and the layout.
92
92
* There are various operations supported, fill, load, store
93
93
* (Slide shows an example of using the extension)
@@ -96,19 +96,19 @@ Dounia Khaldi,
96
96
* Q (Ronan): can you do negative strides or is just unsigned?
97
97
* A: Stride is a positive number.
98
98
* Same example and source can run across Intel CPU, GPU and NVIDIA GPU.
99
- * Additional functions to pass row/col. This is Intel specific,
99
+ * Additional functions to pass row/col. This is Intel specific,
100
100
NVIDIA cannot support this on tensorcores
101
101
* Q(Ruyman): Any restrictions on element wise operations supported?
102
102
* A(Douina): No restriction, any SYCL kernel code is valid
103
- * Size combinations are different between AMX and XMX,
103
+ * Size combinations are different between AMX and XMX,
104
104
and even between generations of XMX. NVIDIA Has different numbers.
105
105
* How do we write portable code? There is a query interface, static and dynamic
106
- * Static queries require hardware architecture checks.
106
+ * Static queries require hardware architecture checks.
107
107
Basic code is similar between SYCL joint matrix and CUDA Fragments
108
- * CUDA code migration to SYCL is simple as it is very close to the
108
+ * CUDA code migration to SYCL is simple as it is very close to the
109
109
wmma operations
110
110
* Joint matrix extension in MLIR generates SPIR-V code for multiple backends
111
- * Currently: Full support of SYCL joint matrix extension
111
+ * Currently: Full support of SYCL joint matrix extension
112
112
on AMX, XMX and NVIDIA Tensor Cores
113
113
* Next steps: Standarization of joint matrix on SYCL and SPIR-V
114
114
@@ -118,32 +118,32 @@ Joint matrix in NVIDIA Tensor Cores
118
118
Mehdi Goli, `Slides <presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf> `
119
119
120
120
* Gemms are used everywhere and its very important we optimize those
121
- * Presentation about Joint Matrix Performance analysis,
121
+ * Presentation about Joint Matrix Performance analysis,
122
122
showing support for SM72 and SM80 (Jetson and Ampere)
123
- * we use the joint matrix extension on both,
123
+ * we use the joint matrix extension on both,
124
124
we can achieve 97% of cuDNN on Jetson
125
- * On SM80 / A100 we use different sizes and see mixed results
126
- (very good on small sizes, really bad on large sizes)
125
+ * On SM80 / A100 we use different sizes and see mixed results
126
+ (very good on small sizes, really bad on large sizes)
127
127
performance comparison with cutlas and cudnn.
128
- * SYCL-BLAS Half and TF32 performance is slightly better for small sizes but
128
+ * SYCL-BLAS Half and TF32 performance is slightly better for small sizes but
129
129
gets much worse for bigger sizes performance comparison with cutlas and cudnn
130
- * NVIDIA uses ldmatrix and cp.async (Shared Store From Global Load) to get
131
- higher performance.
130
+ * NVIDIA uses ldmatrix and cp.async (Shared Store From Global Load) to get
131
+ higher performance.
132
132
These instructions allow to bypass the cache and apply prefetching
133
- * Tensorcore support has evolved across different NVIDIA architectures,
134
- and they have added new instructions that support some advanced
133
+ * Tensorcore support has evolved across different NVIDIA architectures,
134
+ and they have added new instructions that support some advanced
135
135
features using a different part of the PTX ISA (wmma vs mma).
136
- * WMMA is a higher level instruction that mapps to multiple HMMA instructions
136
+ * WMMA is a higher level instruction that mapps to multiple HMMA instructions
137
137
on the SASS.
138
- * MMA instructions map to a single hmma wherever possible,
139
- or backwards compatible breaks down to multiple hmma instructions
138
+ * MMA instructions map to a single hmma wherever possible,
139
+ or backwards compatible breaks down to multiple hmma instructions
140
140
for previous geneerations
141
- * WMMA is exposed in CUDA and what we use today for joint_matrix extension,
141
+ * WMMA is exposed in CUDA and what we use today for joint_matrix extension,
142
142
whereas MMA is what cutlas and other use via hard-coding assembly.
143
143
* Results from NVIDIA suggest WMMA is slower than MMA.
144
- * The performance gap from our joint matrix numbers is due to the
144
+ * The performance gap from our joint matrix numbers is due to the
145
145
lack of cp.async and needs to be added to DPCPP.
146
- * Need somehow to expose the mma instruction to DPCPP so that we can fix
146
+ * Need somehow to expose the mma instruction to DPCPP so that we can fix
147
147
the performance gap.
148
148
* Q(Ruyman) you mean supporting it within joint_matrix extension?
149
149
* A(Mehdi): Yes should be possible
@@ -152,20 +152,20 @@ Mehdi Goli, `Slides <presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf>`
152
152
* Q(Geoff): Why don't we load this on local memory?
153
153
* A(Mehdi): Is not supported in our backend
154
154
* Q(Geoff): If we preload the stuff in SLM wouldnt be get more performance?
155
- * A(Mehdi): Our backend does not supported it, this is one of the key factor
155
+ * A(Mehdi): Our backend does not supported it, this is one of the key factor
156
156
on the performance problems we see.
157
157
* Q(Dounia) Are there technical challenges on the implementation
158
- * A(Mehdi): Its a lot of different configurations and maintenance to the
158
+ * A(Mehdi): Its a lot of different configurations and maintenance to the
159
159
backend. Individual mapping of builtins is difficult.
160
- * A(Dounia): ATS and PVC sizes are different, thats why we have the query.
161
- Implementaiton is bigger but its transparent,
160
+ * A(Dounia): ATS and PVC sizes are different, thats why we have the query.
161
+ Implementaiton is bigger but its transparent,
162
162
the user have to type which hardware they have.
163
- * Q(Geoff): Any matrix multiplication should tile itself onto SLM but seems
164
- its not the case? why joint matrix should be 3 times slower?
165
- they have a nice feature to do it on the ISA
163
+ * Q(Geoff): Any matrix multiplication should tile itself onto SLM but seems
164
+ its not the case? why joint matrix should be 3 times slower?
165
+ they have a nice feature to do it on the ISA
166
166
but you can do that yourself right?
167
- * A(Mehdi): They use a different instruction to implement the loading
168
- that gives better performance,
167
+ * A(Mehdi): They use a different instruction to implement the loading
168
+ that gives better performance,
169
169
we cannot use that instruction in DPC++ backend yet
170
170
171
171
(Meeting adjourned, out of time)
0 commit comments