GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
Fengxiang Wang1,
Mingshuo Chen2,
Yueying Li1,
Yajie Yang3,
Yifan Zhang4*,
Long Lan1
Xue Yang5,
Hongda Sun6*,
Yulin Wang7,
Di Wang8,
Jing Zhang8,
Jun Song*,
Bo Du8
1National University of Defense Technology,
2Beijing University of Posts and Telecommunications
3University of the Chinese Academy of Sciences,
4Chinese Academy of Science
5Shanghai Jiao Tong University,
6Renmin University of China,
7Tsinghua University,
8Wuhan University
- 📚Contents
- 🔍Overview
- 🌐UHR-CoZ Dataset
- 🛠️Methodology & Training
- 🚀Evaluation
- 🔗Citation
- 🤝Acknowledgement
Fig 1. Overview of the AdaZoom-GRPO Framework.
We introduce GeoEyes, a specialized MLLM for Ultra-High-Resolution (UHR) Remote Sensing. Current "thinking-with-images" models suffer from Tool Usage Homogenization—collapsing into rigid, one-size-fits-all zooming patterns that fail to address the task heterogeneity and low evidence density of UHR imagery.
To solve this, we propose a staged training framework:
- Cold-Start SFT: Initializing the model with UHR-CoZ, a dataset containing diverse "Chain-of-Zoom" trajectories (Global, Single-Zoom, Multi-Step).
- AdaZoom-GRPO: An Agentic Reinforcement Learning stage with a novel reward system designed to incentivize on-demand zooming and progressive focusing.
Our method achieves 54.23% accuracy on XLRS-Bench, establishing a new state-of-the-art by outperforming larger models like Qwen2.5-VL-72B and domain-specific agents like DeepEyes.
We construct UHR Chain-of-Zoom (UHR-CoZ), the first large-scale interleaved image-text chain-of-thought dataset specifically for UHR remote sensing. It is built using an automated agentic pipeline (Fig 2) involving GLM-4.5V, which generates multi-round zoom-in trajectories cleaned by a semantic scorer.
Fig 2. Automated data construction pipeline for UHR-CoZ.
| Statistics | Value |
|---|---|
| Total Samples | 25,467 |
| Avg. Image Resolution | 2,178 × 2,051 |
| Zoom-in Depth 1 (No Zoom) | 6.4% |
| Zoom-in Depth 2 | 86.7% |
| Zoom-in Depth |
6.9% |
| Avg. Reasoning Length | 157.8 tokens |
Our approach builds upon the DeepEyes framework, introducing a two-stage optimization process.
- UHR-CoZ: Download our constructed SFT dataset with interleaved zoom trajectories through huggingface.
- SuperRS-VQA: Used during the RL stage to enhance task diversity which is included in UHR-CoZ.
- General RL Data: We utilize DeepEyes-47K for general reasoning stability.
The code base is develeped using torch2.6/2.8+cu128 and Python3.10/3.11.
We perform Supervised Fine-Tuning on UHR-CoZ to initialize the policy with basic tool capabilities and stop-conditions.
# 1. Download and prepare sft data from huggingface
# please make sure to modify the absolute image paths in UHR-CoZ.json
# 2. SFT using llamafactory
# We use this specific commit: https://github.com/hiyouga/LlamaFactory/tree/2a822178dea4d1c05f595521dd883a8e4f4e2e77
# if encountered TypeError during dataset preprocess, refer to https://github.com/hiyouga/LlamaFactory/issues/5613
# modify json paths in dataset_info.json and yaml file
llamafactory-cli train config.yamlWe optimize the model using Group Relative Policy Optimization (GRPO) with our specific reward formulation:
- Adaptive Efficiency Reward (Penalizes redundant tools on easy tasks).
- Chain-of-Focus Reward (Geometric containment reward for progressive zoom).
- Necessity-Aware Process Verification (LLM-based judge for logical rigor).
# 1. first install DeepEyes following https://github.com/Visual-Agent/DeepEyes
# we also provided a clean requirements.txt without torch package
# 2. download RL data, and modify parquet file paths in the training script/yaml file
# there are 3 parquets from DeepEyes-47k and 1 parquet file from UHR-CoZ HF repo
# 3. follow deepeyes to set LLM judge and start training using
# export LLM_AS_A_JUDGE_BASE="http://{IP}:{PORT}/v1"
python -m verl.trainer.main_ppo \
--config-path DeepEyes/config \
--config-name deepeyes_cozWe evaluate on XLRS-Bench, focusing on Perception (e.g., Counting, Object Classification) and Reasoning (e.g., Route Planning, Anomaly Detection) tasks.
# 0. execute the prepare_xlrs_data.ipynb to preprocess the evaluation data
# 1. convert model from pt format to hf model
bash s1.sh
# 2. deploy model using vllm (or ray using `serve run ray.yaml`)
bash s21.sh
# 3. prompting vllm
bash s22.sh
# 4. calculate metrics
bash s232.shIf you find our work helpful, please consider citing:
@article{wang2026geoeyes,
title={GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery},
author={Wang, Fengxiang and Chen, Mingshuo and Li, Yueying and Yang, Yajie and Zhang, Yifan and Lan, Long and Yang, Xue and Sun, Hongda and Wang, Yulin and Wang, Di and others},
journal={arXiv preprint arXiv:2602.14201},
year={2026}
}This repo benefits from DeepEyes and LLaMA-Factory. Thanks for their wonderful works.


