🔥 Update

SPE:A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

Dongchen Si^{1,4,5 }, Di Wang^2, Erzhong Gao^4,5, Xiaolei Qin³, Liu Zhao^4,5, Jing Zhang², Minqiang Xu^{4,5 †},Jianbo Zhan^{4,5 †},Jianshe Wang^4,5,Lin Liu^4,5,Bo Du²,Liangpei Zhang³

¹ Xinjiang University, China,
² School of Computer Science, Wuhan University, China,
³ State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, China,
⁴ iFlytek Co., Ltd, China,
⁵National Engineering Research Center of Speech and Language Information Processing, China,

^∗ Equal contribution, ^† Corresponding author

🔥 Update

2026.03.16

The SPIE dataset and the corresponding SPEX code have been released.

2026.03.04

The main paper is online published! Please see here.

2026.02.25

The paper is accepted by IEEE TGRS!

2025.08.08

We uploaded our work on arXiv.

🌞 Intro

Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness.

🔍 Overview

Figure 1. Overall workflow of the proposed method.

👀 Visualization

Figure 2: Water body extraction examples with masks highlighted in blue.

⚙️ Requirements

python 3.10 and above
pytorch >= 2.1.2, torchvision >= 0.16.2 are recommended
CUDA 12.1 and above is recommended (Please follow the instructions here to install both PyTorch and TorchVision dependencies)
flash-attention2 is required for high-resolution usage

📖 model weight

The model weight has been released. Link: weight, Access code: zMZp

📖 Datasets

The SPIE dataset has been released. Link:Datasets, Access code: iirm

🔨 demo code

For inference, please refer to demo.py.

🔧 Usage (Training & Inference)

Installation 💻: Set up the SPEX conda environment, install dependencies, and clone the repo.
Training 🏋️‍♂️: Run scripts/finetune.sh with DeepSpeed, modifying parameters like data and model paths for training.
Inference 🎯: Execute demo.py to perform model inference, specifying the model path, image folder, and spectral prompt file. Update the paths as needed.
- CUDA_VISIBLE_DEVICES=0 python demo.py --model_path <MODEL_PATH> --image_folder <IMAGE_FILE> --question_file <SPECTRAL_PROMPT.txt>

⭐ Citation

If you find SPEX helpful, please consider giving this repo a ⭐ and citing:

@article{SPEX,
  author={Si, Dongchen and Wang, Di and Gao, Erzhong and Qin, Xiaolei and Zhao, Liu and Zhang, Jing and Xu, Minqiang and Zhan, Jianbo and Wang, Jianshe and Liu, Lin and Du, Bo and Zhang, Liangpei},
  journal={IEEE Transactions on Geoscience and Remote Sensing}, 
  title={SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images}, 
  year={2026},
  volume={},
  number={},
  pages={1-1},
  keywords={Remote sensing;Land surface;Feature extraction;Visualization;Image segmentation;Data mining;Decoding;Adaptation models;Large language models;Indexes;Remote Sensing;Multispectral;Vision-Language Model;Instruction-Driven;Land Cover Extraction},
  doi={10.1109/TGRS.2026.3670308}}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
demo		demo
figs		figs
llava.egg-info		llava.egg-info
llava		llava
scripts		scripts
vision_output		vision_output
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
merge_lora.py		merge_lora.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPE:A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

^∗ Equal contribution, ^† Corresponding author

🔥 Update

🌞 Intro

🔍 Overview

👀 Visualization

⚙️ Requirements

📖 model weight

📖 Datasets

🔨 demo code

🔧 Usage (Training & Inference)

⭐ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPE:A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

∗ Equal contribution, † Corresponding author

🔥 Update

🌞 Intro

🔍 Overview

👀 Visualization

⚙️ Requirements

📖 model weight

📖 Datasets

🔨 demo code

🔧 Usage (Training & Inference)

⭐ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

^∗ Equal contribution, ^† Corresponding author

Packages