High PPL with Quarot + GPTQ Method #333

Kexin2000 · 2025-03-07T03:33:14Z

I tested the Quarot + GPTQ method with W4A4 quantization.

For LLaMA 2-7B:
Only Quarot: PPL = 48
Quarot + GPTQ: PPL = 9.8
However, in Table 12, the reported PPL for Quarot + GPTQ (W4A4) is 6.22.

For LLaMA 3.1-8B:
Only Quarot: PPL = 139
Quarot + GPTQ: PPL = 27.4

Below is the config file：

step_1_quarot.yml

{
    "base": {
        "seed": 0
    },
    "model": {
        "type": "Llama",
        "path": "/workspace/models/Llama-3.1-8b",
        "tokenizer_mode": "slow",
        "torch_dtype": "auto"
    },
    "eval": {
        "eval_pos": [
            "fake_quant"
        ],
        "name": "wikitext2",
        "download": true,
        "path": "/workspace/llmc/Datasets/wikitext2",
        "seq_len": 2048,
        "bs": 1,
        "inference_per_block": false
    },
    "quant": {
        "method": "Quarot",
        "weight": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_channel",
            "group_size": -1,
            "calib_algo": "minmax"
        },
        "act": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_token"
        },
        "special": {
            "rotate_mode": "hadamard",
            "fp32_had": true,
            "online_rotate": false
        }
    },
    "save": {
        "save_trans": true,
        "save_fake": false,
        "save_path": "/workspace/save_models/quarot_trans_for_gptq/llama-3.1-8b"
    }
}

step_2_gptq.yml

{
    "base": {
        "seed": 0
    },
    "model": {
        "type": "Llama",
        "path": "/workspace/save_models/quarot_trans_for_gptq/llama-3.1-8b/transformed_model",
        "torch_dtype": "auto",
        "tokenizer_mode": "slow"
    },
    "calib": {
        "name": "wikitext2",
        "download": true,
        "path": "/workspace/llmc/Datasets/wikitext2",
        "n_samples": 128,
        "bs": 1,
        "seq_len": 2048,
        "preproc": "wikitext2_gptq",
        "seed": 0
    },
    "eval": {
        "eval_pos": [
            "fake_quant"
        ],
        "name": "wikitext2",
        "download": true,
        "path": "/workspace/llmc/Datasets/wikitext2",
        "seq_len": 2048,
        "bs": 1,
        "inference_per_block": false
    },
    "quant": {
        "method": "GPTQ",
        "weight": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_channel",
            "group_size": -1,
            "calib_algo": "mse"
        },
        "act": {
            "bit": 4,
            "symmetric": false,
            "granularity": "per_token",
            "calib_algo": "minmax"
        },
        "special": {
            "actorder": true,
            "static_groups": true,
            "percdamp": 0.01,
            "blocksize": 128,
            "true_sequential": true,
            "online_rotate": false,
            "fp32_had": true
        },
        "quant_out": true
    },
    "save": {
        "save_trans": false,
        "save_fake": false,
        "save_path": "/workspace/save_models/save_after_gptq/llama-3.1-8b"
    }
}

The PPL results I obtained are significantly higher than expected. Is there any known issue with Quarot + GPTQ on LLaMA 3 models, or could I be missing some optimization steps?

Any insights or suggestions would be greatly appreciated!

Harahan · 2025-03-09T09:52:22Z

You should open online_rotate. If there still has a gap, the fast way to reproduce the results is to use the version before August, since we have add lots of new features, which may cause some impacts.

gushiqiao · 2025-03-10T06:48:11Z

You can use the latest code and follow these configurations for optimal results:
QUAROT:
quant:
method: Quarot
weight:
bit: 4
symmetric: False
granularity: per_channel
group_size: -1
calib_algo: minmax
act:
bit: 4
symmetric: False
granularity: per_token
special:
rotate_mode: hadamard
fp32_had: True
online_rotate: True
GPTQ
quant:
method: GPTQ
weight:
bit: 4
symmetric: False
granularity: per_channel
group_size: -1
calib_algo: mse
mse_b_num: 4
act:
bit: 4
symmetric: False
granularity: per_token
calib_algo: minmax
special:
actorder: True
static_groups: True
percdamp: 0.01
blocksize: 128
true_sequential: True
online_rotate: True
fp32_had: True
quant_out: True
With this setup, you should achieve the best results. The evaluation on the Wikitext2 dataset gives a perplexity (ppl) of 6.037587642669678.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High PPL with Quarot + GPTQ Method #333

High PPL with Quarot + GPTQ Method #333

Kexin2000 commented Mar 7, 2025

Harahan commented Mar 9, 2025

gushiqiao commented Mar 10, 2025 •

edited

Loading

High PPL with Quarot + GPTQ Method #333

High PPL with Quarot + GPTQ Method #333

Comments

Kexin2000 commented Mar 7, 2025

Harahan commented Mar 9, 2025

gushiqiao commented Mar 10, 2025 • edited Loading

gushiqiao commented Mar 10, 2025 •

edited

Loading