Skip to content

MiliLab/Omni-I2C

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Omni-I2C

A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Jiawei Zhou1, *, Chi Zhang1, *, Xiang Feng1, Qiming Zhang2, Haibo Qiu2,

Lihuo He3,†, Dengpan Ye4,†, Xinbo Gao3, Jing Zhang1,†

1 Wuhan University, Wuhan, China, 2 Independent Researcher, China, 3 Xidian University, Xi'an, China, 4 Guangzhou University, Guangzhou, China

Corresponding author: jingzhang.cv@gmail.com, lhhe@mail.xidian.edu.cn, yedp@gzhu.edu.cn

💥 Updates

🚀 Introduction

Through an extensive evaluation of 13 proprietary and open-weight LMMs, we reveal a profound performance gap in high-fidelity image-to-code generation. Even leading frontier models, such as Gemini 3 Pro and GPT-5.1, frequently falter in the challenging scenarios presented by our benchmark. These results highlight substantial room for improvement and position Omni-I2C as a challenging benchmark for advancing LMMs. Our contributions are summarized as follows:

  • We present Omni-I2C, a meticulously curated dataset of 1080 items, including 5 programming languages, 8 major subjects, and 45 distinct figure types. It serves as a rigorous testbed for evaluating the perception and coding capabilities of LMMs.
  • We propose an evaluation framework that assesses code-level integrity and image-level perceptual consistency, enabling more diagnostic and attributable analyses of model behavior than traditional heuristic metrics.
  • Our comprehensive analysis of SOTA LMMs exposes a significant performance gap in high-fidelity reconstruction, identifying critical failure modes and charting a path toward more precise, trusted multimodal agents.

🔖 Abstract

We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception—to parse intricate spatial hierarchies and symbolic details—and precise generative expression—to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction.

Omni-I2C features 1.1k meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content—from scientific visualizations to complex symbolic notations—each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge.

🔍 Overview

📊 Benchmarks

Please refer to the Benchmark Results for more details.

1. Structure

We provide a detailed project structure for Omni-I2C. This project consists of two main components: VLMEvalKit_infer, which is adapted from VLMEvalKit for inference, and eval_pipeline, designed for the evaluation process. Please follow this structure to organize the project.

📁 Structure (Click to collapse)
Omni-I2C
├── doc
│   ├── images
│   └── results
├── eval_pipeline
│   ├── eval_image_prompt.py
│   ├── eval_prompts.py
│   ├── gt_data
│   ├── libs
│   ├── main_pipeline.py
│   ├── pipeline_config.py
│   ├── run_main.sh
│   ├── step1_execute.py
│   ├── step2_evaluate.py
│   └── step3_evaluate.py
├── README.md
├── requirements.txt
└── VLMEvalKit_infer
    ├── assets
    ├── docs
    ├── LICENSE
    ├── requirements
    ├── run_example.sh
    ├── run.py
    ├── scripts
    ├── setup.py
    ├── vlmeval

2. Installation

We provide a detailed installation guide to create an environment for Omni-I2C. Please refer to the following steps to set up the environment, especially for next-gen hardware support (e.g., RTX 5090). If you use a machine with the Blackwell architecture for reasoning, you will need to upgrade the versions of torch(>=2.9.1) and vllm(>=0.12.0).

⚙️ Installation (Click to collapse)
# 1. Clone Omni-I2C
git clone https://github.com/MiliLab/Omni-I2C.git
cd Omni-I2C/VLMEvalKit_infer

# 2. Create conda environment (Python 3.10.18 recommended)
conda create -n omni_i2c python=3.10.18 -y
conda activate omni_i2c

# 3. Install PyTorch & CUDA
# We recommend CUDA 12.8 for best driver support on next-gen hardware (e.g., RTX 5090).
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 \
  --index-url https://download.pytorch.org/whl/cu128

# 4. Install Flash-Attention 2 (v2.8.3)
git clone -b v2.8.3 https://github.com/Dao-AILab/flash-attention.git
cd flash-attention

# [Option A] Standard installation
MAX_JOBS=4 pip install flash-attn --no-build-isolation

# [Option B] Optimized for Blackwell Architecture (Recommended for RTX 5090)
# export TORCH_CUDA_ARCH_LIST="12.0"
# MAX_JOBS=4 pip install flash-attn --no-build-isolation

cd .. # return to VLMEvalKit_infer directory

# 5. Install VLMEvalKit_infer
pip install -e .

# Install Acceleration Backend (Choose one)
# [LMDeploy]
pip install https://github.com/InternLM/lmdeploy/releases/download/v0.11.1/lmdeploy-0.11.1+cu128-cp310-cp310-manylinux2014_x86_64.whl \
  --extra-index-url https://download.pytorch.org/whl/cu128

# [vLLM]
uv pip install vllm==0.11.0

# 6. Install Omni-I2C (Evaluation Pipeline)
cd .. # Go back to project root (Omni-I2C)

# Install Python requirements for evaluation
pip install -r requirements.txt

# --- System Dependencies for Evaluation (Requires sudo) ---
# Update source
sudo apt-get update

# Install LaTeX environment (Required for TikZ & standalone)
sudo apt-get install -y \
  texlive-latex-base \
  texlive-latex-extra \
  texlive-pictures \
  texlive-fonts-recommended

# Install PDF to Image tools (Required for pdftocairo)
sudo apt-get install -y poppler-utils

# Install the Google Noto font package
sudo apt-get install -y fonts-noto fonts-noto-cjk fonts-noto-color-emoji

# --- Playwright Setup ---
# Install Chromium and system dependencies
playwright install chromium
playwright install-deps

Reference

VLMEvalkit is built upon the OpenCompass VLMEvalKit.

3. Data Preparing

The datas that need to be prepared are the inference data and the ground truth data for evaluation.

Inference Data

First, download the inference data from Link. After downloading, you need to update the file path in the configuration:

  1. Open Omni-I2C/VLMEvalKit_infer/vlmeval/dataset/image2code.py.
  2. Locate line 18 and replace the default path Image2Code_Full.tsv with your actual local path.
# Omni-I2C/VLMEvalKit_infer/vlmeval/dataset/image2code.py

# ... (Previous code)
# Line 18: Change 'Image2Code_Full.tsv' to your local path
'Image2Code_Full': '<Your folder>/Image2Code_Full.tsv',
# ...

Ground Truth Data

The ground truth (GT) data is already included in the repository (Omni-I2C/eval_pipeline/gt_data/gt_data.tar.gz). Please unzip it before running the evaluation pipeline.

cd ./eval_pipeline/gt_data
tar -xzvf gt_data.tar.gz

4. Infering & Evaluation

4.1 Inference

The inference process is based on VLMEvalKit_infer.

  1. Configuration: Modify Omni-I2C/VLMEvalKit_infer/vlmeval/config.py to select the models you want to test. For a detailed configuration tutorial, please refer to the VLMEvalKit Quickstart.
  2. Execution: Run the example script.
cd Omni-I2C/VLMEvalKit_infer
bash run_example.sh

Note: The inference results will be saved in Omni-I2C/VLMEvalKit_infer/output.

4.2 Evaluation

After inference, you need to move the result files to the evaluation pipeline.

  1. Preparation: Create an infer folder inside eval_pipeline and move your inference results there.
  2. API Configuration: Open Omni-I2C/eval_pipeline/pipeline_config.py and configure the necessary API keys for evaluation.
  3. Execution: Run the main evaluation script.
cd Omni-I2C/eval_pipeline
mkdir -p infer

# [Important] Move your inference results from VLMEvalKit_infer/output into ./infer
# cp ../VLMEvalKit_infer/output/your_result.json ./infer/

bash run_main.sh

Evaluation Results:

  • Final Report: Located in Omni-I2C/eval_pipeline/output.
  • Intermediate Checkpoints: Located inside each model's folder in the working directory:
  • step1_checkpoint.jsonl: Render results.
  • step2_checkpoint.jsonl: Code-level evaluation results.
  • step3_checkpoint.jsonl: Image-level evaluation results.
ℹ️ Evaluation Pipeline Details (Click to expand)

The evaluation pipeline consists of three main steps managed by main_pipeline.py.

File Structure & Functionality:

eval_pipeline
├── eval_image_prompt.py  # Prompts for Image-level evaluation
├── eval_prompts.py       # Prompts for Code-level evaluation
├── gt_data               # Ground Truth data (e.g., gt_data.tar.gz)
├── libs                  # Libraries for HTML rendering (echarts, jquery)
├── main_pipeline.py      # Main entry point for evaluation
├── pipeline_config.py    # Configuration file (Set API keys here)
├── run_main.sh           # Execution script
├── step1_execute.py      # Step 1: Render inference code to images
├── step2_evaluate.py     # Step 2: Code-level evaluation
└── step3_evaluate.py     # Step 3: Image-level evaluation

About

Official repo for "A Holistic Benchmark for High-Fidelity Image-to-Code Generation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors