LLMTabBench is a benchmarking framework for evaluating Large Language Models (LLMs) on binary classification of tabular data in a few-shot setting. The benchmark measures how well LLMs can classify samples of tabular datasets when provided with only a small number of labeled examples (shots) as in-context demonstrations or in the zero-shot settings.
The framework supports:
- Local LLM inference via vLLM (Qwen, Gemma, DeepSeek model families)
- API-based LLM inference via OpenRouter (GPT)
- Classical ML baselines (Logistic Regression, KNN, Random Forest, Gradient Boosting) with Bayesian hyperparameter optimization
- Multiple data serialization formats (Markdown, HTML, JSON, CSV, LaTeX, natural language, and more)
- OpenML datasets and custom local datasets (.csv)
- Automated metric calculation (ROC-AUC, F1) with aggregation across multiple random seeds
For detailed results, see detailed_results.md.
| Dataset Type | Qwen3-1.7B | Qwen3-8B | Qwen3-14B | GPT-4o-mini | TabPFN (16-shot) | TabICL (16-shot) |
|---|---|---|---|---|---|---|
| Real Datasets | 0.640 ± 0.004 | 0.736 ± 0.000 | 0.733 ± 0.000 | 0.729 ± 0.007 | 0.767 ± 0.052 | 0.738 ± 0.048 |
| New Datasets | 0.587 ± 0.003 | 0.657 ± 0.001 | 0.657 ± 0.000 | 0.684 ± 0.006 | 0.699 ± 0.065 | 0.660 ± 0.050 |
| MLP-Synthetic | 0.544 ± 0.004 | 0.545 ± 0.018 | 0.554 ± 0.029 | 0.568 ± 0.033 | 0.651 ± 0.051 | 0.626 ± 0.048 |
| LLM-Synthetic | 0.569 ± 0.006 | 0.669 ± 0.000 | 0.657 ± 0.000 | 0.738 ± 0.005 | 0.762 ± 0.051 | 0.698 ± 0.051 |
- Python ≥ 3.12
- uv package manager (installed automatically by
makeif missing) - NVIDIA GPU with CUDA support (for local LLM inference)
# 1. Clone the repository
git clone git@github.com:sb-ai-lab/llm4tab.git
cd llm4tab
# 2. Set up the environment (creates venv + installs all dependencies)
make setup
# 3. Create .env file and add your API keys (if using API models)
cp .env_example .env
# Edit .env and add your OPENROUTER_KEY if needed
# 4. Run the benchmark
make runThis will execute main.py using the settings defined in config.yaml.
All experiment settings are controlled via a single config.yaml file. Below is a section-by-section reference.
data:
DF_TYPE: 'openml'
DF_FORMAT: 'csv'
DATASET_NAME: 'airbnb'
DF_TRANSFORMATION_REGIME: 'no'
LOCAL_DATASET_PATH: 'datasets/'
RESULT_PATH: 'results/experiments_result/'
PROBS_PATH: 'results/llm_probs/'| Parameter | Description |
|---|---|
DF_TYPE |
'openml' — fetch datasets from OpenML; 'custom' — load CSVs from datasets/ folder |
DF_FORMAT |
File format for custom datasets (currently only 'csv' is supported) |
DATASET_NAME |
Name of the dataset (auto-set during iteration) |
DF_TRANSFORMATION_REGIME |
Optional preprocessing: 'no' (none), 'scaling' (MinMax), 'words' (numbers→words), 'shuffling' (shuffle columns), 'change_units' (dataset-specific unit conversion) |
LOCAL_DATASET_PATH |
Directory where custom CSV files are stored |
RESULT_PATH |
Directory to save experiment result CSVs |
PROBS_PATH |
Directory to save prediction probability pickle files |
Set DF_TYPE: 'openml' in the data section and configure the openml section:
data:
DF_TYPE: 'openml'
openml:
type: 'dataset'
df_openml: ['compas', 'vote', 'kc1', 'irish']
dataset: {
telco: 42178,
pc4: 1049,
irish: 451,
compas: 42193,
vote: 56,
cancer: 15,
steel: 1504,
kc1: 1067
}
task: {
credit: 363626,
transfusion: 363621,
fitness: 363671,
diabetes: 363629,
biodegr: 363696,
marketing: 363684
}| Parameter | Description |
|---|---|
type |
'dataset' — use OpenML dataset IDs; 'task' — use OpenML task IDs (with predefined train/test splits) |
df_openml |
List of dataset names to run experiments on. Each name must exist as a key in the dataset or task dictionary |
dataset |
Mapping of dataset names → OpenML dataset IDs |
task |
Mapping of dataset names → OpenML task IDs |
Available OpenML datasets: telco, pc4, tae, irish, compas, vote, cancer, steel, kc1, credit, transfusion, fitness, diabetes, biodegr, marketing.
Set DF_TYPE: 'custom' in the data section:
data:
DF_TYPE: 'custom'
DF_FORMAT: 'csv'
LOCAL_DATASET_PATH: 'datasets/'Custom dataset requirements:
- Place your CSV files in the
datasets/directory - Each CSV must have a target column named
label(ortarget— it will be auto-renamed tolabel) - The target column must be binary (values:
0and1) - For datasets not listed in LLMTabBench, define a task description in
source/prompts.pyin theprompt_by_dfdictionary:
prompt_by_df = {
# ... existing entries ...
'my_dataset': """For the given input features ... your task description ...
Your output should be just a single number: 0 or 1. Do not predict any other tokens, only 0 or 1.""",
}When DF_TYPE is 'custom', the code automatically discovers all files in datasets/ and iterates over them.
experiment:
baseline: False
local_llm: True
regime: 'local_gen'
random_states_list: [864, 460, 142, 629, 761]
serialization_list: ['feat_val', 'feat_val_masked', 'html', 'markdown', 'markdown_masked']
shot_list: [4, 8, 16, 32, 64]
RATIO: '50/50'
N_SHOTS: 0
SAMPLING_REGIME: 'default'
RANDOM_STATE: 42
SCHEMA: '{"class": "0 or 1"}'
THINKING: False
CONFIG_CODE: configuration_code| Parameter | Description |
|---|---|
baseline |
True — run classical ML baselines; False — run LLM experiments |
local_llm |
True — use local vLLM inference; False — use API (OpenRouter) |
regime |
Inference mode: 'local_gen' (local generation), 'local_nogen' (forward pass only), 'api_gen' (API generation) |
random_states_list |
List of random seeds; each experiment runs once per seed for stability |
serialization_list |
List of serialization formats to test (see below) |
shot_list |
List of few-shot sizes (e.g. [4, 8, 16, 32, 64]) |
RATIO |
Class ratio in few-shot examples, e.g. '50/50' for balanced |
SAMPLING_REGIME |
How to order few-shot examples: 'default' (random), 'halves: zeros_first', 'halves: ones_first', 'cycle: zeros_first', 'cycle: ones_first' |
SCHEMA |
JSON schema hint appended to the system prompt for local LLMs |
THINKING |
Enable/disable thinking mode in tokenizer chat template |
CONFIG_CODE |
Tag appended to output file names for experiment tracking |
| Key | Format | Example |
|---|---|---|
feat_val |
Feature-value pairs | Features are: age = 39, education = Bachelor. Answer is 0. |
feat_val_masked |
Anonymized features | Features are: x_1 = 39, x_2 = Bachelor. Answer is: y = 0. |
feat_name |
Natural feature names | Features are: the age is 39, the education is Bachelor. Answer is 0. |
json |
JSON | {"0": {"age": 39, "education": "Bachelor", "label": "0"}} |
datamatrix |
Nested lists | [['age', 'education', 'label'], [39, 'Bachelor', '0']] |
markdown |
Markdown table | |age|education|label| |
markdown_masked |
Markdown (anonymized columns) | |feature_1|feature_2|label| |
html_new |
HTML table | <table><thead>...</thead><tbody>...</tbody></table> |
latex |
LaTeX table | age & education & label \\ |
csv |
Comma-separated | age, education, label |
table |
Pipe-delimited | | 39 | Bachelor | 0 | |
dict |
Python dict | {'age': [39], 'label': ['0']} |
natural_language |
Natural language | Requires a natural_language_data column in the dataset |
local_model:
name: "Qwen/Qwen3-1.7B"
temperature: 0
gpu_memory_utilization: 0.12
gpu_device: "0"Supported model families:
- Qwen: e.g.
Qwen/Qwen3-1.7B - Google Gemma: e.g.
google/gemma-7b-it - DeepSeek: e.g.
deepseek-ai/deepseek-llm-7b-chat
Each model family has its own prompt template and token handling logic.
baseline_model:
name: 'logreg'
all_names: ['logreg', 'gboost', 'rf']When experiment.baseline: True, the benchmark runs classical ML models instead of LLMs.
| Model name | Algorithm |
|---|---|
logreg |
Logistic Regression (with Bayesian hyperparameter search) |
knn |
K-Nearest Neighbors |
rf |
Random Forest |
gboost |
Histogram-based Gradient Boosting (with early stopping) |
naive_argmax |
Dummy classifier (most frequent class) |
All baseline models (except naive_argmax) use BayesSearchCV from scikit-optimize for hyperparameter tuning.
# config.yaml
data:
DF_TYPE: 'openml'
experiment:
baseline: False
local_llm: True
regime: 'local_gen'
shot_list: [4, 8, 16]
serialization_list: ['markdown_new', 'json_new']
local_model:
name: "Qwen/Qwen3-1.7B"
temperature: 0
gpu_memory_utilization: 0.12
gpu_device: "0"
openml:
type: 'dataset'
df_openml: ['compas', 'vote']Then run:
make run# config.yaml
data:
DF_TYPE: 'openml'
experiment:
baseline: True
shot_list: [4, 8, 16, 32, 64]
baseline_model:
name: 'logreg'
all_names: ['logreg', 'gboost', 'rf']
openml:
type: 'dataset'
df_openml: ['compas', 'vote', 'kc1']- Place your CSV files into
datasets/:
datasets/
├── my_data.csv
└── another_data.csv
- Add task descriptions in
source/prompts.py(if you run datasets not listed in LLMTabBench):
prompt_by_df = {
...
'my_data': """For the given input features ... predict class 0 or 1. Do not predict any other tokens, only 0 or 1.""",
'another_data': """For the given input features ... predict class 0 or 1. Do not predict any other tokens, only 0 or 1.""",
}- Update
config.yaml:
data:
DF_TYPE: 'custom'
DF_FORMAT: 'csv'
LOCAL_DATASET_PATH: 'datasets/'- Run:
make runResults are saved in two directories:
results/experiments_result/
└── <dataset_name>/
└── <N>_shots/
└── <model_name>/
└── rs_<regime>/
├── df_<N>fs_<model>_<regime>_<dataset>_<seed>_<config_code>.csv # per-seed
└── df_<N>fs_<model>_<regime>_<dataset>_<config_code>_agr.csv # aggregated
The aggregated CSV contains roc_auc_mean, roc_auc_std, f1_mean, f1_std, time_mean, time_std across all random seeds.
results/llm_probs/
└── <df_type>_datasets/
└── <dataset_name>/
└── <N>_shots/
└── <model_name>/
└── rs_<regime>/
└── df_<N>fs_<model>_<regime>_<dataset>_<serialization>_<seed>_<config_code>.pkl
Each .pkl file contains a dictionary with pred_probs, true_labels, pred_labels, timestamp, params, and metadata.
To obtain ready-to-use LaTeX tables (broken down by shots and serializations) for all datasets:
-
In
config.yaml, set theTGEN_LOCAL_DATASET_PATHparameter to the desired aggregation table:agr_all_prompt1— for the first prompt configuration (default)agr_all_prompt2— for the second prompt configurationagr_all_prompt3— for the third prompt configuration
All available aggregation tables are located in the
datasets/folder. -
Run the
run_genscript from thegen_tablesfolder:python gen_tables/run_gen
The generated LaTeX tables will appear in the results/ folder.
Note: All final tables are also collected in RESULTS.md in the repository root — you can view them directly in Markdown format.
- GPU memory: Adjust
local_model.gpu_memory_utilizationdepending on your GPU. For small models like Qwen3-1.7B,0.12may be enough; for 7B models, increase to0.45+. - Reproducibility: Results are aggregated across multiple random seeds defined in
random_states_list. Using the same seeds ensures reproducibility. - Custom datasets: Ensure your target column is binary (
0/1). If the column is namedtarget, it will be automatically renamed tolabel.