Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support importing GGUF files #1187

Open
richardanaya opened this issue Jan 29, 2024 · 22 comments
Open

Support importing GGUF files #1187

richardanaya opened this issue Jan 29, 2024 · 22 comments
Labels
feature The feature request

Comments

@richardanaya
Copy link

richardanaya commented Jan 29, 2024

I apologize if this seems too far fetched, but it seemed in line with how ONNX generation works.

@antimora
Copy link
Collaborator

If gguf contains the model graph information, then we can use what burn-import ONNX facility. In our burn-import, we convert ONNX graph to IR (intermediate representation) (see this doc). So, it would possible to convert the model graph to IR and generate source code + weights.

If gguf contains only weights, we can go burn-import pytorch route, where we only download weights.

@antimora
Copy link
Collaborator

antimora commented Jan 29, 2024

From my brief research, GGUF format contains metadata + tensor weights. This aligns with burn-import pytorch route and not burn-import/ONNX. This will mean model needs to be constructed in Burn first and use the weights to load.

Here is one Rust lib to parse GGUF file: https://github.com/Jimexist/gguf

@antimora
Copy link
Collaborator

GGUF spec: ggml-org/ggml#302

@antimora
Copy link
Collaborator

Parser in Rust: https://github.com/Jimexist/gguf

@antimora antimora changed the title Support generating burn models from GGUF files? Support importing GGUF files Mar 28, 2024
@antimora antimora added the feature The feature request label Mar 29, 2024
@leflambeur
Copy link

leflambeur commented Jan 18, 2025

Hi, it has been about a year since this was last updated, since then pre-existing models on HF often come in GGUF for quantised or Safetensor formats for non-quantised.

I think it would be useful to people new to the space to understand how burn can be leveraged with these formats, as they seem to be the most common formats available to start from scratch.

Specifically importing quantised GGUF models as I couldn't see much in the docs.

Candle is okay for this, but its support for models is a little spotty with quantised models which are more accessible to people with fewer resources.

I saw in #1323 some added pieces were available for reconstructing config files but I am wondering about simply ingesting a gguf model and using it with burn directly similar to the import options for onnx or pytorch without people needing to figure out how to reverse engineer what gguf is doing under the hood with little guidance.

GGUF's single file format seems like an ideal target for burn's use-case to me and the format is much more universally accessible, similar to ONNX on paper.

I am happy to contribute docs, I just need a bit of direction to start testing with the current capabilities or indicators that it is even possible.

Edit Ref to the Candle issue I am seeing with Mistral-Nemo Quantizations:

huggingface/candle#2727

@antimora
Copy link
Collaborator

I'll be happy to assist if you decide to submit a pr.

We can leverage Candle's reader similar PyTorch pt reader. We can use the existing burn-import infrastructure. It should be somewhat easier now that PyTorch pt import works.

@leflambeur
Copy link

leflambeur commented Jan 20, 2025

I actually made a start last night using candle_core::quantized::gguf_file::Content as this is decidedly quicker for building out the metadata and also gets you the layer/tensor structures and weight dimensions without loading the whole model, from there I figured you could infer details like MQA vs GQA vs MHA from attention.head_count and attention.head_count_kv and then map a consistent set of burn modules or blocks (which I am also trying ot figure out which ones are appropriate) to the layers described in the GGUF spec with the correct weights (all of which are consistently named in GGUF (mostly)) without needing to do too much more

Example names from the gguf spec that could be mapped:

tok_embd
attn_norm
attn_k
attn_q
attn_v
attn_output
ffn_norm
ffn_gate
ffn_up
ffn_down
output_norm
output

I am very new to rust so it's taking me a bit of time to figure out how to transform the format Content creates as rather than treat things directly as u32 or String everything is stored in 'U32('VALUE')' first and being able to transform those and then map them to the right places to create burn modules etc is a bit of time and effort

@leflambeur
Copy link

When I say stored as an example of the K, V structure it uses:

"llama.attention.head_count_kv": U32(8)
"llama.context_length": U32(1024000)
"llama.attention.key_length": U32(128)
"llama.block_count": U32(40)
"general.size_label": String("12B")
"general.file_type": U32(7)
"general.type": String("model")
"llama.attention.value_length": U32(128)
"llama.attention.layer_norm_rms_epsilon": F32(1e-5)
"general.version": String("2407")
"llama.rope.dimension_count": U32(128)
"llama.vocab_size": U32(131072)
"llama.rope.freq_base": F32(1000000.0)
"llama.attention.head_count": U32(32)
"llama.embedding_length": U32(5120)
"llama.feed_forward_length": U32(14336)
"general.quantization_version": U32(2)

Rather than say:

llama.attention.head_count_kv: 8

@leflambeur
Copy link

I actually haven't used Burn at all until now, I only learnt detailed information about transformer architecture after posting my original comment two days ago, and I started with Rust like 3-4 weeks ago so I will try my best but I apologise in advance if I can't see it through.

It's partly my motivation for commenting, as someone new to the whole space all I see is gguf really, I would love to make it more accessible to those of us who want to get started, and from what I can tell Burn is well placed for doing that - I also love you have built in WGPU support - and my ambition for learning Rust to do this is because years ago I did a lot of embedded and I have a load of RPi Picos and various other devices lying around so love you guys have the demo for it, and also I think your approach is fantastic for my goals.

Most of my career until now has been more devops oriented, and even then I have been more on the infrastructure and networking side than development so I am out of my depth but trying.

I can figure out most things on my own but any general pointers are always welcome, I will try and figure it out.

@leflambeur
Copy link

I have been making a lot more progress than I anticipated working this out slowly

I have one quick question @antimora

In the recent burn release notes it says there is now mix precision support in matmul

It could still be my naivety but does that mean matmul now supports int type tensors?

I am currently looking at this

https://github.com/tracel-ai/burn/blob/main/crates/burn-import/src/burn/node/matmul.rs

@antimora
Copy link
Collaborator

In the recent burn release notes it says there is now mix precision support in matmul

It could still be my naivety but does that mean matmul now supports int type tensors?

I am currently looking at this

https://github.com/tracel-ai/burn/blob/main/crates/burn-import/src/burn/node/matmul.rs

Probably something else in Burn's core and not related to ONNX import.

@antimora
Copy link
Collaborator

@leflambeur Nice progress. Let us know when you start dealing with the weights.

@leflambeur
Copy link

I am ignoring the ONNX import for now, right now I am testing against the LlamaConfig in

https://github.com/tracel-ai/models/blob/main/llama-burn/src/llama.rs

e.g. In my code:

	.with_num_attention_heads(
		content.metadata.get("llama.attention.head_count").unwrap().to_u32()?.to_owned() as usize,
	)
	.with_num_key_value_heads(Some(
		content.metadata.get("llama.attention.head_count_kv").unwrap().to_u32()?.to_owned()
			as usize,
	))

where candle quantized:

	let path = "../../models/<model>.gguf";
	let mut file = std::fs::File::open(path).map_err(E::msg)?;
	let content = Content::read(&mut file).map_err(|e| e.with_path(path))?;

gets me the metadata:

llama.attention.head_count_kv: U32(8)
llama.block_count: U32(40)llama.attention.head_count: U32(32)

The reason I ask about matmul is that with Quantization support, you could match the Quantization scheme ahead of time from the metadata and ensure you have the right type of tensor (float/int) ( general.quantization_version: U32(2) maps to Q4_0 which is 4 bit quantization and I believe u16... I need to double check my maths but I haven't gotten this far)

When you load the weights completely from the metadata, not just what I am doing above which get s description of the weights, you get a convenient map of each layer including MatMuls to use

I figured if I had those I could use the modules described in the burn part of 'burn-import' to map directly without going through llama-burn but it's baby steps atm so I am starting with llama-burn and seeing how I go

@leflambeur
Copy link

Example output from loading the full weights using candle:

ModelWeights { tok_embeddings: Embedding { embeddings: Tensor[dims 131072, 5120; f32, metal:4294968751], hidden_size: 5120 }, layers: [LayerWeights { attention_wq: QMatMul { inner: QTensor(QTensor[[4096, 5120]; Q8_0]), span: Span { none: true } }, attention_wk: QMatMul { inner: QTensor(QTensor[[1024, 5120]; Q8_0]), span: Span { none: true } }, attention_wv: QMatMul { inner: QTensor(QTensor[[1024, 5120]; Q8_0]), span: Span { none: true } }, attention_wo: QMatMul { inner: QTensor(QTensor[[5120, 4096]; Q8_0]), span: Span { none: true } }, 

If you just use the metadata you get a lot of the tensor/weight layouts without needing the whole thing but it's not as granular - which may not be an issue with a consistent format:

   },
    "blk.0.attn_v.weight": TensorInfo {
        ggml_dtype: Q8_0,
        shape: [1024, 5120],
        offset: 1710202880,
    },
    "blk.25.ffn_down.weight": TensorInfo {
        ggml_dtype: Q8_0,
        shape: [5120, 14336],
        offset: 6640865280,
    },
    "blk.8.ffn_up.weight": TensorInfo {
        ggml_dtype: Q8_0,
        shape: [14336, 5120],
        offset: 12591042560,
    },

Loading the metadata is way quicker than the full weights, you just have to abstract/make assumptions about the weights and their behaviours

I am not sure if it would be more future forward to use the more granular weight loading, or just use the metadata and make assumptions as it's quicker

So for now I am testing llama-burn works - slowly getting there

@antimora
Copy link
Collaborator

CCing @laggui . He worked on Quantization feature of Burn and he's also familiar with burn-import. @laggui, your input is appreciated here.

@laggui
Copy link
Member

laggui commented Jan 29, 2025

Regarding the mixed precision matmul, that is only for floating-point types.

Quantization is still very much a WIP. I've only added simple per-tensor schemes for affine and symmetric (scale only) quantization for int8, so we can load quantized tensors but the operations all perform dequantize -> float op -> quantize.

I am not entirely familiar with the GGUF format but Q8_0 is int8 symmetric quantization with quantization parameters (scale only) for a block of weights (not per-tensor, so not the same).

@leflambeur
Copy link

@laggui I am probably wrong but if I am reading correctly:

https://github.com/ggerganov/ggml/blob/master/docs/gguf.md

general.file_type: uint32: An enumerated value describing the type of the majority of the tensors in the file.

I think this means that the majority but not all tensors are described (i.e. Q4) in this scheme

But then it is expected for each tensor to have it's own marked scheme:

{ inner: QTensor(QTensor[[4096, 5120]; Q8_0])

Or

   "blk.0.attn_v.weight": TensorInfo {
        ggml_dtype: Q8_0,
        shape: [1024, 5120],
        offset: 1710202880,
    },

Like you see above

So loading the general file_type metadata lets you make an assumption about the majority of tensors, however the specific scheme for each tensor is described in the tensor info of the metadata or by loading the full weights (again not an expert and making some educated guesses)

I haven't had the chance to look at this since last Thursday as I have had other things going on, but I am making some progress today - I have been working out of another project I was working on but I will isolate this work out of it and try and get it somewhere more public

@laggui
Copy link
Member

laggui commented Jan 29, 2025

You can visualize this on HF hub. It really depends on the model, some have a lot of different precisions mixed in.

For example, this Q4_0 model has some parameters in F16, F32, Q8_0 and Q4_0. I guess the majority of the tensors will have the lowest precision described for the file type, but there could be any other higher precision tensors in the file.

@leflambeur
Copy link

Yeah, I saw it in HF hub as well

I think the main thing I want to validate:

Is my approach of using the metadata to generate/build a model in burn (I.e. use the standardisation of GGUF) using standard building blocks and importing the pretrained weights at the same time a sensible approach without needing to define the whole model in advance

My idea was you could use the generics of GGUF to infer a consistent skeleton and fill in the details from the model metadata on specific weights

The user experience I am aiming for roughly is:

burn-import <gguf-file> <other-options i.e. device/backend etc>

Output:

./burn_model/<files> 

Where the model file is built by burn-import with no extra input needed really

I think this is a sensible-ish approach, and I am working to test the basic idea with the llama-burn model you made to do the top level without touching individual tensors, but I am sure I will probably get some details wrong as I think I am trying to fit a squarer peg in a round hole in this early stage of testing with llama-burn which seems to be tuned to a few specific models

The next step would be doing it procedurally from just the model file and tensorinfo and not using llama-burn

@antimora
Copy link
Collaborator

Yeah, I saw it in HF hub as well

I think the main thing I want to validate:

Is my approach of using the metadata to generate/build a model in burn (I.e. use the standardisation of GGUF) using standard building blocks and importing the pretrained weights at the same time a sensible approach without needing to define the whole model in advance

My idea was you could use the generics of GGUF to infer a consistent skeleton and fill in the details from the model metadata on specific weights

The user experience I am aiming for roughly is:

burn-import <gguf-file> <other-options i.e. device/backend etc>

Output:

./burn_model/<files> 

Where the model file is built by burn-import with no extra input needed really

I think this is a sensible-ish approach, and I am working to test the basic idea with the llama-burn model you made to do the top level without touching individual tensors, but I am sure I will probably get some details wrong as I think I am trying to fit a squarer peg in a round hole in this early stage of testing with llama-burn which seems to be tuned to a few specific models

The next step would be doing it procedurally from just the model file and tensorinfo and not using llama-burn

gguf contains hyper parameter information to build a model. However, someone still needs to do the initial work of defining the model structure. You might be able to infer a model structure from the hyper parameter names but there is no standard, so whatever logic you may will be specific to the exporter.

@leflambeur
Copy link

Yeah that makes sense, I will keep testing and feedback, I think it's a challenge at the moment because it's not easy for people to figure this out without diving really deep and it makes the space less accessible despite the consumer awareness of AI atm

I think there are some big opportunities here to make things more usable and so it's worth the effort to try and lower the barrier of entry even if I am stumbling around in the dark a bit

@antimora
Copy link
Collaborator

antimora commented Feb 1, 2025

@leflambeur yes, I agree. I suspect it would be a common operation for many.

If you has something working, please share your knowledge (in the comment or book section). That's how many discover about possibilities. I personally did not encounter a use case for myself, so that's why I didn't contribute. Mine is currently limited to dealing with PyTorch files (pt). So I make the tools and docs generic and submitted to Burn's repo. It became very useful for others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature The feature request
Projects
None yet
Development

No branches or pull requests

4 participants