Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High PPL with Quarot + GPTQ Method #333

Open
Kexin2000 opened this issue Mar 7, 2025 · 2 comments
Open

High PPL with Quarot + GPTQ Method #333

Kexin2000 opened this issue Mar 7, 2025 · 2 comments

Comments

@Kexin2000
Copy link

I tested the Quarot + GPTQ method with W4A4 quantization.

For LLaMA 2-7B:
Only Quarot: PPL = 48
Quarot + GPTQ: PPL = 9.8
However, in Table 12, the reported PPL for Quarot + GPTQ (W4A4) is 6.22.

For LLaMA 3.1-8B:
Only Quarot: PPL = 139
Quarot + GPTQ: PPL = 27.4

Below is the config file:

  1. step_1_quarot.yml
{
    "base": {
        "seed": 0
    },
    "model": {
        "type": "Llama",
        "path": "/workspace/models/Llama-3.1-8b",
        "tokenizer_mode": "slow",
        "torch_dtype": "auto"
    },
    "eval": {
        "eval_pos": [
            "fake_quant"
        ],
        "name": "wikitext2",
        "download": true,
        "path": "/workspace/llmc/Datasets/wikitext2",
        "seq_len": 2048,
        "bs": 1,
        "inference_per_block": false
    },
    "quant": {
        "method": "Quarot",
        "weight": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_channel",
            "group_size": -1,
            "calib_algo": "minmax"
        },
        "act": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_token"
        },
        "special": {
            "rotate_mode": "hadamard",
            "fp32_had": true,
            "online_rotate": false
        }
    },
    "save": {
        "save_trans": true,
        "save_fake": false,
        "save_path": "/workspace/save_models/quarot_trans_for_gptq/llama-3.1-8b"
    }
}
  1. step_2_gptq.yml
{
    "base": {
        "seed": 0
    },
    "model": {
        "type": "Llama",
        "path": "/workspace/save_models/quarot_trans_for_gptq/llama-3.1-8b/transformed_model",
        "torch_dtype": "auto",
        "tokenizer_mode": "slow"
    },
    "calib": {
        "name": "wikitext2",
        "download": true,
        "path": "/workspace/llmc/Datasets/wikitext2",
        "n_samples": 128,
        "bs": 1,
        "seq_len": 2048,
        "preproc": "wikitext2_gptq",
        "seed": 0
    },
    "eval": {
        "eval_pos": [
            "fake_quant"
        ],
        "name": "wikitext2",
        "download": true,
        "path": "/workspace/llmc/Datasets/wikitext2",
        "seq_len": 2048,
        "bs": 1,
        "inference_per_block": false
    },
    "quant": {
        "method": "GPTQ",
        "weight": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_channel",
            "group_size": -1,
            "calib_algo": "mse"
        },
        "act": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_token",
            "calib_algo": "minmax"
        },
        "special": {
            "actorder": true,
            "static_groups": true,
            "percdamp": 0.01,
            "blocksize": 128,
            "true_sequential": true,
            "online_rotate": false,
            "fp32_had": true
        },
        "quant_out": true
    },
    "save": {
        "save_trans": false,
        "save_fake": false,
        "save_path": "/workspace/save_models/save_after_gptq/llama-3.1-8b"
    }
}

The PPL results I obtained are significantly higher than expected. Is there any known issue with Quarot + GPTQ on LLaMA 3 models, or could I be missing some optimization steps?

Any insights or suggestions would be greatly appreciated!

@Harahan
Copy link
Collaborator

Harahan commented Mar 9, 2025

You should open online_rotate. If there still has a gap, the fast way to reproduce the results is to use the version before August, since we have add lots of new features, which may cause some impacts.

@gushiqiao
Copy link
Contributor

gushiqiao commented Mar 10, 2025

You can use the latest code and follow these configurations for optimal results:
QUAROT:
quant:
method: Quarot
weight:
bit: 4
symmetric: False
granularity: per_channel
group_size: -1
calib_algo: minmax
act:
bit: 4
symmetric: False
granularity: per_token
special:
rotate_mode: hadamard
fp32_had: True
online_rotate: True
GPTQ
quant:
method: GPTQ
weight:
bit: 4
symmetric: False
granularity: per_channel
group_size: -1
calib_algo: mse
mse_b_num: 4
act:
bit: 4
symmetric: False
granularity: per_token
calib_algo: minmax
special:
actorder: True
static_groups: True
percdamp: 0.01
blocksize: 128
true_sequential: True
online_rotate: True
fp32_had: True
quant_out: True
With this setup, you should achieve the best results. The evaluation on the Wikitext2 dataset gives a perplexity (ppl) of 6.037587642669678.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants