- 2025.11.14: We have released the MiMo-VL-Miloco-7B and MiMo-VL-Miloco-7B-GGUF. Enjoy it!
Welcome to Xiaomi MiMo-VL-Miloco — the first open-source multimodal model built to actually understand what’s happening at home!
- Built on MiMo-VL-7B: a rock-solid vision–language backbone with reliable video understanding and instruction-following.
- Home-savvy by design: it spots everyday activities (esports, workouts, watching TV, reading, and more) and reads common hand gestures like the V sign, thumbs-up, open palm, OK, and even the shaka hand sign.
- Base skills intact: with a mix training strategy of SFT and RL, we boost home-scene smarts while keeping the model’s generality and transferability in great shape.
We use a carefully tuned two-stage pipeline to nail home-scene skills without sacrificing general abilities.
This stage focuses on boosting the model’s core capabilities in home scenarios. Even with a limited training set, we strike a good balance between sample-efficient learning and fast inference:
- Chain-of-thought supervision: we add chain of reasoning so the model learns structured knowledge about home scenarios.
- Token-budget-aware reasoning: training with “budgeted” reasoning encourages concise, straight-to-the-point answers at inference.
Building on fine-tuning, this stage introduces GRPO-based reinforcement learning to enhance the model’s overall performance:
- Efficient Training Data: we employed the Time-R1 data strategy (our work accepted at NeurIPS 2025) to build efficient training datasets across multiple domains.
- Keep-it-general: specialize for home tasks while preserving broad understanding and language generation.
In short: Xiaomi MiMo-VL-Miloco is your friendly, sharp-eyed model roommate—great at recognizing what’s going on around the house, and still ready for the wider world.
Both versions of the MiMo-VL-Miloco-7B model are now open-sourced:
-
- Recommended for most users to experience and utilize.
-
- This is the mixed-precision quantized version of MiMo-VL-Miloco-7B. It is recommended for evaluation and use in compute-constrained environments.
- MiMo-VL-Miloco-7B achieves leading performance in both gesture recognition and common household scene understanding.
In household scene understanding, we prioritize video and image perception alongside the model’s reasoning ability.
- Across three video benchmarks (Video-MME, Video-MMMU, Charades-STA), the base model shows clear improvements.
- On MMMU-Pro, a general-capabilities benchmark, the base model also saw significant improvements (10+%).
- Surprisingly, as video and image understanding improved, we observed corresponding gains on the text-only task MMLU-Pro.
- We see a modest performance dip on tasks such as document understanding, OCR, and mathematics; this is in line with expectations and does not affect the model’s intended use cases.
We follow the same approach as MiMo-VL. Users can control the thinking mode by appending /no_think to queries:
- Thinking mode (default):
"Explain the relationships between the objects in the image and infer the likely next action." - Non-thinking mode:
"Transcribe the handwritten note exactly as shown. /no_think"
- Installation
pip install -r requirements.txt- Deployment
cd demo
CKPT_PATH="checkpoint_path" python app.pyIn the interface, you can click Smart Home mode to switch to the home scenario mode.
@misc{xiaomimimovlmiloco,
author = {Jiaze Li, Yuxun Qu, Jingyang Chen, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, JianZhong Ju, Zhenbo Luo, Jian Luan},
title = {Xiaomi MiMo-VL-Miloco},
year = {2025},
howpublished = {\url{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}},
}Please contact us at [email protected] or open an issue if you have any questions.