Regarding getting different prompt response for CPU vs my Custom kernel #12519
Closed
akapoor3518
started this conversation in
General
Replies: 1 comment 3 replies
-
It's normal for results from different backends to be different, due to small floating-point calculation differences. Though in this case it seems like a bug. Your best option to find the bug is to dump the data after each operator and compare it in order to see where it fails. Also make sure that |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
HI @ggerganov,
I spend considerable amount of time to understand why we have two different response even with temp 0.0 (this is deterministic ) while running for different backend. I am blocked to proceed further. Soon We need to add many other kernel operation and go for some Customer Demo. I am really looking your help. Please either you or anyone of your team member help me it will be great. I have tried all my option on this
Same when i run with small model like
tinyllama-vo-5m-para.gguf (it work fine with two different backend for below example even with bigger --n-predict 10 or 20 or 30 still response very consistent with both backend cpu vs custom-hardware) - I am very happy here with response comparison
Thanks in advance!
Below are detail of my analysis
#######################
Below are example where i am running prompt with getting 5 token CPU vs my custom-hardware(this support GMML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL & GGML_OP_DIV only this operation offloaded to my custom-hardware rest goes to CPU). How
ggml_backend_sched_split_graph --- FOR ME, This is very clear how graph is getting spilt and i dont see any issue here.
ggml_backend_sched_compute_splits(here we are doing graph computer from left to right, since i dont have any graph planning hence all my custom-hardware graph compute under this thread of ggml_backend_sched_compute_splits and CPU compute for other op of graph computation run on its own thread. I have also my synchronize function so that when tensor copy happen to my tensor of my custom-hardware(it has final data from CPU which is not overwritten)
Still i am not able to figureout where i am getting two different response. Why at my response all junk Character coming.
./build/bin/llama-cli -p "my cat name" -m ./models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup
########
my cat name.
I'm a cat. I like
vs
./build/bin/llama-cli -p "my cat name" -m ./models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device custom-hardware -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup
my cat name mach
.-
a.
llama_perf_sampler_print: sampling time = 20.69 ms / 14 runs ( 1.48 ms per token, 676.59 tokens per second)
llama_perf_context_print: load time = 2817.93 ms
llama_perf_context_print: prompt eval time = 2339.53 ms / 4 tokens ( 584.88 ms per token, 1.71 tokens per second)
llama_perf_context_print: eval time = 4820.91 ms / 9 runs ( 535.66 ms per token, 1.87 tokens per second)
llama_perf_context_print: total time = 7663.02 ms / 13 tokens
##########
Below are MY STATS for my custome-kernel
ADD Operation, total tensor: 10 Number of Kernel Call: 320 Number of tensor got spilt: 10 Min Num of Elem 2048 Max Num of Elem 2048
SUB Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0
MUL Operation, total tensor: 450 Number of Kernel Call: 14400 Number of tensor got spilt: 450 Min Num of Elem 2048 Max Num of Elem 8192
DIV Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0
Beta Was this translation helpful? Give feedback.
All reactions