Skip to content
This repository was archived by the owner on Feb 5, 2024. It is now read-only.

Commit 63bdcee

Browse files
committed
Trailing whitespaces
1 parent e6b8a20 commit 63bdcee

File tree

1 file changed

+43
-43
lines changed

1 file changed

+43
-43
lines changed

language/README.rst

+43-43
Original file line numberDiff line numberDiff line change
@@ -61,33 +61,33 @@ Potential Topics
6161
Joint Matrix: A Unified SYCL Extension for Matrix Hardware Programming
6262
------------------------------------------------------------------------
6363

64-
Dounia Khaldi,
64+
Dounia Khaldi,
6565
`Slides <presentations/2023-06-07-DK-matrix-oneapi-language.pdf.pdf>`_
6666

6767
* Great community reception, with contributions to MLIR dialects upstream
68-
* Different levels of abstraction are exposed to different users.
69-
Joint matrix is aim for the middle.
70-
Breaks down gemm into primitives, its low level,
68+
* Different levels of abstraction are exposed to different users.
69+
Joint matrix is aim for the middle.
70+
Breaks down gemm into primitives, its low level,
7171
but its portable across targets.
72-
* This presentation will cover both the SPIR-V and the SYCL extension,
72+
* This presentation will cover both the SPIR-V and the SYCL extension,
7373
both are needed for different targets
74-
* Joint matrix relies in various abstractions for Code generation
74+
* Joint matrix relies in various abstractions for Code generation
7575
(PTX ISA, SPIRV, GPU intrinsics..)
76-
* Joint matrix is not a replacement of the framework and the libraries,
77-
this is useful when implementing new operations
76+
* Joint matrix is not a replacement of the framework and the libraries,
77+
this is useful when implementing new operations
7878
or optimizing unexpected combinations of operations
79-
* This is also useful for library developers,
79+
* This is also useful for library developers,
8080
they need to write code that is portable
81-
* Intel PVC has two stacks with slides, each slide has 16 XE core,
81+
* Intel PVC has two stacks with slides, each slide has 16 XE core,
8282
8 vector engines and 8 XMX engines (GEMM accelerators)
83-
* Intel Xeon codenamed Sapphire Rapids have AMX extensions,
83+
* Intel Xeon codenamed Sapphire Rapids have AMX extensions,
8484
which are GEMM accelerators
8585
* NVIDIA and other hardware vendors have their own GEMM accelerators
8686
* You would need a lot of different intrinsics to target all of them
87-
* The SYCL Matrix extension is an experimental SYCL API at this point
87+
* The SYCL Matrix extension is an experimental SYCL API at this point
8888
so it may change from one release to the other
89-
* The joint matrix has a type of group, only subgroup is supported.
90-
Use is the matrix A,B or accumulator for GEMM,
89+
* The joint matrix has a type of group, only subgroup is supported.
90+
Use is the matrix A,B or accumulator for GEMM,
9191
then you specify the shape (Rows, columns) and the layout.
9292
* There are various operations supported, fill, load, store
9393
* (Slide shows an example of using the extension)
@@ -96,19 +96,19 @@ Dounia Khaldi,
9696
* Q (Ronan): can you do negative strides or is just unsigned?
9797
* A: Stride is a positive number.
9898
* Same example and source can run across Intel CPU, GPU and NVIDIA GPU.
99-
* Additional functions to pass row/col. This is Intel specific,
99+
* Additional functions to pass row/col. This is Intel specific,
100100
NVIDIA cannot support this on tensorcores
101101
* Q(Ruyman): Any restrictions on element wise operations supported?
102102
* A(Douina): No restriction, any SYCL kernel code is valid
103-
* Size combinations are different between AMX and XMX,
103+
* Size combinations are different between AMX and XMX,
104104
and even between generations of XMX. NVIDIA Has different numbers.
105105
* How do we write portable code? There is a query interface, static and dynamic
106-
* Static queries require hardware architecture checks.
106+
* Static queries require hardware architecture checks.
107107
Basic code is similar between SYCL joint matrix and CUDA Fragments
108-
* CUDA code migration to SYCL is simple as it is very close to the
108+
* CUDA code migration to SYCL is simple as it is very close to the
109109
wmma operations
110110
* Joint matrix extension in MLIR generates SPIR-V code for multiple backends
111-
* Currently: Full support of SYCL joint matrix extension
111+
* Currently: Full support of SYCL joint matrix extension
112112
on AMX, XMX and NVIDIA Tensor Cores
113113
* Next steps: Standarization of joint matrix on SYCL and SPIR-V
114114

@@ -118,32 +118,32 @@ Joint matrix in NVIDIA Tensor Cores
118118
Mehdi Goli, `Slides <presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf>`
119119

120120
* Gemms are used everywhere and its very important we optimize those
121-
* Presentation about Joint Matrix Performance analysis,
121+
* Presentation about Joint Matrix Performance analysis,
122122
showing support for SM72 and SM80 (Jetson and Ampere)
123-
* we use the joint matrix extension on both,
123+
* we use the joint matrix extension on both,
124124
we can achieve 97% of cuDNN on Jetson
125-
* On SM80 / A100 we use different sizes and see mixed results
126-
(very good on small sizes, really bad on large sizes)
125+
* On SM80 / A100 we use different sizes and see mixed results
126+
(very good on small sizes, really bad on large sizes)
127127
performance comparison with cutlas and cudnn.
128-
* SYCL-BLAS Half and TF32 performance is slightly better for small sizes but
128+
* SYCL-BLAS Half and TF32 performance is slightly better for small sizes but
129129
gets much worse for bigger sizes performance comparison with cutlas and cudnn
130-
* NVIDIA uses ldmatrix and cp.async (Shared Store From Global Load) to get
131-
higher performance.
130+
* NVIDIA uses ldmatrix and cp.async (Shared Store From Global Load) to get
131+
higher performance.
132132
These instructions allow to bypass the cache and apply prefetching
133-
* Tensorcore support has evolved across different NVIDIA architectures,
134-
and they have added new instructions that support some advanced
133+
* Tensorcore support has evolved across different NVIDIA architectures,
134+
and they have added new instructions that support some advanced
135135
features using a different part of the PTX ISA (wmma vs mma).
136-
* WMMA is a higher level instruction that mapps to multiple HMMA instructions
136+
* WMMA is a higher level instruction that mapps to multiple HMMA instructions
137137
on the SASS.
138-
* MMA instructions map to a single hmma wherever possible,
139-
or backwards compatible breaks down to multiple hmma instructions
138+
* MMA instructions map to a single hmma wherever possible,
139+
or backwards compatible breaks down to multiple hmma instructions
140140
for previous geneerations
141-
* WMMA is exposed in CUDA and what we use today for joint_matrix extension,
141+
* WMMA is exposed in CUDA and what we use today for joint_matrix extension,
142142
whereas MMA is what cutlas and other use via hard-coding assembly.
143143
* Results from NVIDIA suggest WMMA is slower than MMA.
144-
* The performance gap from our joint matrix numbers is due to the
144+
* The performance gap from our joint matrix numbers is due to the
145145
lack of cp.async and needs to be added to DPCPP.
146-
* Need somehow to expose the mma instruction to DPCPP so that we can fix
146+
* Need somehow to expose the mma instruction to DPCPP so that we can fix
147147
the performance gap.
148148
* Q(Ruyman) you mean supporting it within joint_matrix extension?
149149
* A(Mehdi): Yes should be possible
@@ -152,20 +152,20 @@ Mehdi Goli, `Slides <presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf>`
152152
* Q(Geoff): Why don't we load this on local memory?
153153
* A(Mehdi): Is not supported in our backend
154154
* Q(Geoff): If we preload the stuff in SLM wouldnt be get more performance?
155-
* A(Mehdi): Our backend does not supported it, this is one of the key factor
155+
* A(Mehdi): Our backend does not supported it, this is one of the key factor
156156
on the performance problems we see.
157157
* Q(Dounia) Are there technical challenges on the implementation
158-
* A(Mehdi): Its a lot of different configurations and maintenance to the
158+
* A(Mehdi): Its a lot of different configurations and maintenance to the
159159
backend. Individual mapping of builtins is difficult.
160-
* A(Dounia): ATS and PVC sizes are different, thats why we have the query.
161-
Implementaiton is bigger but its transparent,
160+
* A(Dounia): ATS and PVC sizes are different, thats why we have the query.
161+
Implementaiton is bigger but its transparent,
162162
the user have to type which hardware they have.
163-
* Q(Geoff): Any matrix multiplication should tile itself onto SLM but seems
164-
its not the case? why joint matrix should be 3 times slower?
165-
they have a nice feature to do it on the ISA
163+
* Q(Geoff): Any matrix multiplication should tile itself onto SLM but seems
164+
its not the case? why joint matrix should be 3 times slower?
165+
they have a nice feature to do it on the ISA
166166
but you can do that yourself right?
167-
* A(Mehdi): They use a different instruction to implement the loading
168-
that gives better performance,
167+
* A(Mehdi): They use a different instruction to implement the loading
168+
that gives better performance,
169169
we cannot use that instruction in DPC++ backend yet
170170

171171
(Meeting adjourned, out of time)

0 commit comments

Comments
 (0)