Regarding getting different prompt response for CPU vs my Custom kernel #12519

akapoor3518 · 2025-03-22T23:35:45Z

akapoor3518
Mar 22, 2025

HI @ggerganov,
I spend considerable amount of time to understand why we have two different response even with temp 0.0 (this is deterministic ) while running for different backend. I am blocked to proceed further. Soon We need to add many other kernel operation and go for some Customer Demo. I am really looking your help. Please either you or anyone of your team member help me it will be great. I have tried all my option on this

Same when i run with small model like
tinyllama-vo-5m-para.gguf (it work fine with two different backend for below example even with bigger --n-predict 10 or 20 or 30 still response very consistent with both backend cpu vs custom-hardware) - I am very happy here with response comparison
Thanks in advance!

Below are detail of my analysis
#######################
Below are example where i am running prompt with getting 5 token CPU vs my custom-hardware(this support GMML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL & GGML_OP_DIV only this operation offloaded to my custom-hardware rest goes to CPU). How
ggml_backend_sched_split_graph --- FOR ME, This is very clear how graph is getting spilt and i dont see any issue here.
ggml_backend_sched_compute_splits(here we are doing graph computer from left to right, since i dont have any graph planning hence all my custom-hardware graph compute under this thread of ggml_backend_sched_compute_splits and CPU compute for other op of graph computation run on its own thread. I have also my synchronize function so that when tensor copy happen to my tensor of my custom-hardware(it has final data from CPU which is not overwritten)
Still i am not able to figureout where i am getting two different response. Why at my response all junk Character coming.

./build/bin/llama-cli -p "my cat name" -m ./models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup
########
my cat name.
I'm a cat. I like

vs
./build/bin/llama-cli -p "my cat name" -m ./models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device custom-hardware -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup

my cat name mach
.-

a.

llama_perf_sampler_print: sampling time = 20.69 ms / 14 runs ( 1.48 ms per token, 676.59 tokens per second)
llama_perf_context_print: load time = 2817.93 ms
llama_perf_context_print: prompt eval time = 2339.53 ms / 4 tokens ( 584.88 ms per token, 1.71 tokens per second)
llama_perf_context_print: eval time = 4820.91 ms / 9 runs ( 535.66 ms per token, 1.87 tokens per second)
llama_perf_context_print: total time = 7663.02 ms / 13 tokens

##########
Below are MY STATS for my custome-kernel
ADD Operation, total tensor: 10 Number of Kernel Call: 320 Number of tensor got spilt: 10 Min Num of Elem 2048 Max Num of Elem 2048

SUB Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

MUL Operation, total tensor: 450 Number of Kernel Call: 14400 Number of tensor got spilt: 450 Min Num of Elem 2048 Max Num of Elem 8192

DIV Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

ggerganov · 2025-03-23T08:14:14Z

ggerganov
Mar 23, 2025
Maintainer

It's normal for results from different backends to be different, due to small floating-point calculation differences. Though in this case it seems like a bug. Your best option to find the bug is to dump the data after each operator and compare it in order to see where it fails. Also make sure that test-backend-ops runs correctly with your backend.

3 replies

akapoor3518 Mar 23, 2025
Author

Thanks alot @ggerganov for your help.some test cases are failing at test-backend-ops, few example below. This is starting point for me. Once resolved i will also run the model and debug as you suggested.

Address of Newly Created BUffer 0x7f00718e1080 and size 10510336
[ADD] NMSE = 2.970426971 > 0.000000100 FAIL
MUL(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): ggml_backend_customer-hardware_buffer_type_alloc_buffer is called from llama data Loader

Allocating memory from custom-hardware_alloc with size 10510336 starting memory 0x7f00668e1080

Address of Newly Created BUffer 0x7f00668e1080 and size 10510336
[MUL] NMSE = 0.009851312 > 0.000000100 FAIL
DIV(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): ggml_backend_customer-hardware_buffer_type_alloc_buffer is called from llama data Loader

akapoor3518 Mar 24, 2025
Author

Hi @ggerganov,
Another question regarding for below example for F32 GGML_OP_ADD operatio
if we have node(elm size 32), node->src0(elm size 32), node->src1(elm size 1) if we need to ADD operation as ELEM VISE(for GGML_OP_ADD) Is this following approach correct for CPU code, below is pseudo code:
float f1, f2, r;
f1 = (float)node->src0->data;
f2 = (float)node->src1->data
res = (float)node->data
int64_t len, tmp_len;
len =1;
for (int i=0; i < GGML_MAX_DIMS && node->src0->nb[i] !=0; ++i)
len = lennode->src0->ne[i];
tmp_len=1;
for (int i=0; i < GGML_MAX_DIMS && node->src1->nb[i] !=0; ++i)
tmp_len = lennode->src1->ne[i];
if (tmp_len < len)
len = tmp_len;
for (int i=0; i < len; ++i)
r = f1[i] + f2[i];

akapoor3518 Mar 25, 2025
Author

Hi @ggerganov,
I am able to resolve all issue, hence close this. Thanks for helping me out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regarding getting different prompt response for CPU vs my Custom kernel #12519

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Regarding getting different prompt response for CPU vs my Custom kernel #12519

Uh oh!

Uh oh!

akapoor3518 Mar 22, 2025

Replies: 1 comment · 3 replies

Uh oh!

ggerganov Mar 23, 2025 Maintainer

Uh oh!

akapoor3518 Mar 23, 2025 Author

Uh oh!

akapoor3518 Mar 24, 2025 Author

Uh oh!

akapoor3518 Mar 25, 2025 Author

akapoor3518
Mar 22, 2025

Replies: 1 comment 3 replies

ggerganov
Mar 23, 2025
Maintainer

akapoor3518 Mar 23, 2025
Author

akapoor3518 Mar 24, 2025
Author

akapoor3518 Mar 25, 2025
Author