[CPU] Add fp16 support to sparse attention #24015

fajin-corp · 2025-03-13T01:01:39Z

Description

Add fp16 support to sparse attention

Motivation and Context

Generalize models for CPU and GPU

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

+                   q, head_size, k, head_size, output, total_seq_len,
+                   MLFloat16(alpha).val, static_cast<uint16_t>(0) /*beta*/, nullptr);
+        } else {
+          size_t bytes = head_size * (sequence_length + total_seq_len) * sizeof(float);


To fix the problem, we need to ensure that the multiplication is performed using a larger integer type to avoid overflow. This can be done by casting one of the operands to size_t before performing the multiplication. This way, the multiplication will be done in the larger integer type, preventing overflow.

Cast one of the operands to size_t before performing the multiplication.

Specifically, cast head_size to size_t in the multiplication expression on line 236.

No additional methods, imports, or definitions are needed to implement this change.

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

+          BufferUniquePtr scratch_buffer(q_k_fp32, BufferDeleter(allocator));
+
+          float* q_fp32 = static_cast<float*>(q_k_fp32);
+          MlasConvertHalfToFloatBuffer(q, q_fp32, head_size * sequence_length);


To fix the problem, we need to ensure that the multiplication is performed using a larger integer type to avoid overflow. This can be done by casting one or both of the operands to size_t before performing the multiplication. This way, the multiplication will be done in the size_t type, which has a larger range than int.

Specifically, we will modify the line where the multiplication occurs to cast head_size to size_t before multiplying it by sequence_length.

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

+          MlasConvertHalfToFloatBuffer(q, q_fp32, head_size * sequence_length);
+
+          float* k_fp32 = q_fp32 + head_size * sequence_length;
+          MlasConvertHalfToFloatBuffer(k, k_fp32, head_size * total_seq_len);


To fix the problem, we need to ensure that the multiplication is performed using the larger type (size_t) to avoid overflow. This can be achieved by casting one of the operands to size_t before performing the multiplication. This way, the multiplication will be done in the larger type, preventing overflow.

Cast one of the operands (head_size or total_seq_len) to size_t before the multiplication.

This change should be made on line 244 where the multiplication occurs.

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

+                       v, head_size, output_current, hidden_size,
+                       MLFloat16(1.0f).val, static_cast<uint16_t>(0) /*beta*/, nullptr);
+            } else {
+              size_t bytes = head_size * total_seq_len * sizeof(float);


To fix the problem, we need to ensure that the multiplication is performed using the larger integer type to avoid overflow. This can be done by casting one of the operands to size_t before performing the multiplication. This way, the multiplication will be done in the larger type, preventing overflow.

The best way to fix this is to cast head_size to size_t before the multiplication on line 450. This change will ensure that the multiplication is performed using size_t, avoiding any potential overflow.

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

+              BufferUniquePtr scratch_buffer(v_fp32, BufferDeleter(allocator));
+
+              float* v_fp32_ptr = static_cast<float*>(v_fp32);
+              MlasConvertHalfToFloatBuffer(v, v_fp32_ptr, head_size * total_seq_len);


To fix the problem, we need to ensure that the multiplication is performed using a larger integer type to avoid overflow. This can be done by casting one of the operands to size_t before performing the multiplication. This way, the multiplication will be done using the size_t type, which has a larger range than int.

The best way to fix this without changing existing functionality is to cast head_size to size_t before the multiplication. This change should be made on line 455 of the file onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h.

fajin-corp added 5 commits March 12, 2025 11:08

init

bdb2507

finished softmax(q x k')

6dd5e40

finished softmax * v

6212ee7

fix linting

326f7aa

fix build

5c26099

github-advanced-security bot found potential problems Mar 13, 2025

View reviewed changes

@@ -235,3 +235,3 @@
                     } else {
-                      size_t bytes = head_size * (sequence_length + total_seq_len) * sizeof(float);
+                      size_t bytes = static_cast<size_t>(head_size) * (sequence_length + total_seq_len) * sizeof(float);
                       auto q_k_fp32 = allocator->Alloc(bytes);

@@ -449,3 +449,3 @@
                         } else {
-                          size_t bytes = head_size * total_seq_len * sizeof(float);
+                          size_t bytes = static_cast<size_t>(head_size) * total_seq_len * sizeof(float);
                           auto v_fp32 = allocator->Alloc(bytes);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Add fp16 support to sparse attention #24015

[CPU] Add fp16 support to sparse attention #24015

fajin-corp commented Mar 13, 2025

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

[CPU] Add fp16 support to sparse attention #24015

Are you sure you want to change the base?

[CPU] Add fp16 support to sparse attention #24015

Conversation

fajin-corp commented Mar 13, 2025

Description

Motivation and Context