llama-qwen2vl-cli: Some buffer problems? #11470

domasofan · 2025-01-28T20:40:25Z

domasofan
Jan 28, 2025

Hi all,

Is this a bug or am I doing something wrong?
I use llama.cpp b4575 avx2 build on Windows 11 Pro 24H2.
The build comes from the releases page.
My CPU is a Intel I5-12500 and i have 32 GB RAM.
No GPU, just the Intel UHD Graphics 770 onboard chip which I guess is for no use with llama.cpp.
I used this model:
https://huggingface.co/bartowski/Qwen2-VL-7B-Instruct-GGUF

Other text generating models work.
Thought i might try this one for image description purposes.

Output of llama-qwen2vl-cli is:

$ "c:\ai\bin\llama-qwen2vl-cli.exe" -fa --mlock -m "c:\ai\models\Qwen2-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "c:\ai\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf" --image "D:\images\GOPR3056.JPG" --temp 0.1 -p "describe the image in detail."

build: 4575 (cae9fb4) with MSVC 19.42.34436.0 for x64
llama_model_loader: loaded meta data with 37 key-value pairs and 339 tensors from c:\ai\models\Qwen2-VL-7B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2 VL 7B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2-VL
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2 VL 7B
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2-VL-7B
llama_model_loader: - kv 11: general.tags arr[str,2] = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 13: qwen2vl.block_count u32 = 28
llama_model_loader: - kv 14: qwen2vl.context_length u32 = 32768
llama_model_loader: - kv 15: qwen2vl.embedding_length u32 = 3584
llama_model_loader: - kv 16: qwen2vl.feed_forward_length u32 = 18944
llama_model_loader: - kv 17: qwen2vl.attention.head_count u32 = 28
llama_model_loader: - kv 18: qwen2vl.attention.head_count_kv u32 = 4
llama_model_loader: - kv 19: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 21: general.file_type u32 = 15
llama_model_loader: - kv 22: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.chat_template str = {% set image_count = namespace(value=...
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: quantize.imatrix.file str = /models_out/Qwen2-VL-7B-Instruct-GGUF...
llama_model_loader: - kv 34: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 35: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 36: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.36 GiB (4.91 BPW)
load: special tokens cache size = 14
load: token to piece cache size = 0.9309 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = Qwen2 VL 7B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 148848 '├ä─¼'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: max token length = 256
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 4460.45 MiB
clip_model_load: model name: Qwen2-VL-7B-Instruct
clip_model_load: description: image encoder for Qwen2VL
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 521
clip_model_load: n_kv: 20
clip_model_load: ftype: f16

clip_model_load: loaded meta data with 20 key-value pairs and 521 tensors from c:\ai\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: general.description str = image encoder for Qwen2VL
clip_model_load: - kv 2: general.file_type u32 = 1
clip_model_load: - kv 3: clip.has_text_encoder bool = false
clip_model_load: - kv 4: clip.has_vision_encoder bool = true
clip_model_load: - kv 5: clip.has_qwen2vl_merger bool = true
clip_model_load: - kv 6: clip.projector_type str = qwen2vl_merger
clip_model_load: - kv 7: clip.use_silu bool = false
clip_model_load: - kv 8: clip.use_gelu bool = false
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.image_size u32 = 560
clip_model_load: - kv 11: clip.vision.embedding_length u32 = 1280
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 3584
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 15: clip.vision.block_count u32 = 32
clip_model_load: - kv 16: clip.vision.feed_forward_length u32 = 0
clip_model_load: - kv 17: general.name str = Qwen2-VL-7B-Instruct
clip_model_load: - kv 18: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 19: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - type f32: 325 tensors
clip_model_load: - type f16: 196 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 0
clip_model_load: minicpmv_projector: 0
clip_model_load: model size: 1289.95 MB
clip_model_load: metadata size: 0.18 MB
clip_model_load: params backend buffer size = 1289.95 MB (521 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 198.93 MB
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 224.00 MiB
llama_init_from_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_init_from_model: CPU output buffer size = 0.58 MiB
llama_init_from_model: CPU compute buffer size = 304.00 MiB
llama_init_from_model: graph nodes = 875
llama_init_from_model: graph splits = 1
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 245969298432
ggml_gallocr_reserve_n: failed to allocate CPU buffer of size 245969298432
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-backend.cpp:262: GGML_ASSERT(buf != NULL && "tensor buffer not set") failed

Maybe someone could help me here what is happening?
RAM seems to be a lot available.

Greetings and thanks,
Simon

slaren · 2025-01-28T21:15:49Z

slaren
Jan 28, 2025
Collaborator

You are running out of memory. Try with a lower resolution image.

0 replies

domasofan · 2025-01-29T07:48:22Z

domasofan
Jan 29, 2025
Author

Hi @slaren,

Thanks for the info.
According to task manager i have a lot of ram left.
What i have seen it just used 11-12 gb out of ram of 32 gb.
Or is this different memory?

As a blind person i don't know how big the image should be.

In the old days i used llava 1.5/1.6 and it was able to deal with high resolution images.
Where the old models resizing the images by themselves?

Which model is best at the moment for image descriptions?

It also seems that chatting with the model about the image seems not possible anymore.

Greetings and thanks,
Simon

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-qwen2vl-cli: Some buffer problems? #11470

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

llama-qwen2vl-cli: Some buffer problems? #11470

domasofan Jan 28, 2025

Replies: 2 comments

slaren Jan 28, 2025 Collaborator

domasofan Jan 29, 2025 Author

domasofan
Jan 28, 2025

slaren
Jan 28, 2025
Collaborator

domasofan
Jan 29, 2025
Author