Title | Venue | Date | Code |
---|---|---|---|
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents | arXiv | 2024.1.17 | njucckevin/SeeClick: The model, data and code for the visual GUI Agent SeeClick (github.com) |
ShowUI: One Vision-Language-Action Model for GUI Visual Agent | NeurIPS2024 Open-World Agents workshop | 2024.11.26 | showlab/ShowUI: Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use. (github.com) |
Aria-UI: Visual Grounding for GUI Instructions | arXiv | 2024.12.20 | AriaUI/Aria-UI: Aria-UI: Visual Grounding for GUI Instructions (github.com) |
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents | arXiv | 2024.10.7 | OSU-NLP-Group/UGround: UGround: Universal GUI Visual Grounding for GUI Agents (github.com) |
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction | arXiv | 2024.12.5 | xlang-ai/aguvis: Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (github.com) |
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents | arXiv | 2024.10.30 | OS-Copilot/OS-Atlas: OS-ATLAS: A Foundation Action Model For Generalist GUI Agents (github.com) |
CogAgent: A Visual Language Model for GUI Agents | CVPR 2024 conference Highlight (top 3%). | 2023.12.14;2024.12.27(v3) | THUDM/CogAgent: An open-sourced end-to-end VLM-based GUI Agent (github.com) |
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection | arXiv | 2025.1.8 | Reallm-Labs/InfiGUIAgent (github.com) |
UI-TARS: Pioneering Automated GUI Interaction with Native Agents | arXiv | 2025.1.21 | https://github.com/bytedance/UI-TARS |
GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration | arXiv | 2025.1.23 | https://github.com/GUI-Bee/gui-bee.github.io |
UI-TARS: Pioneering Automated GUI Interaction with Native Agents | arXiv | 2025.1.21 | https://github.com/bytedance/UI-TARS |
GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration | arXiv | 2025.1.23 |
Title | Venue | Date | Code | Note |
---|---|---|---|---|
Towards general computer control: A multimodal agent for red dead redemption ii as a case study | ICLR 2024 Workshop on Large Language Model (LLM) Agents | 2024.3.5 | https://github.com/BAAI-Agents/Cradle | |
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides | arXiv | 2025.1.7 | https://github.com/icip-cas/PPTAgent | None GUI Grounding |
Title | Venue | Date | Code |
---|---|---|---|
Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language | https://github.com/FellouAI/eko |
Title | Venue | Date | Code |
---|---|---|---|
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | NeurIPS 2024 | 2024.4.11 | [xlang-ai/OSWorld: NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (github.com) |
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale | arXiv | 2024.9.12 | microsoft/WindowsAgentArena: Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents. (github.com) |
Title | Venue | Date | Code |
---|---|---|---|
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents | arXiv | 2024.1.17 | njucckevin/SeeClick: The model, data and code for the visual GUI Agent SeeClick (github.com) |
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use | likaixin2000/ScreenSpot-Pro-GUI-Grounding: GUI Grounding for Professional High-Resolution Computer Use (github.com) | ||
Title | Venue | Date | Code | Note |
---|---|---|---|---|
Android in the Wild: A Large-Scale Dataset for Android Device Control | NeurIPS 2024 | 2023.7.19 | google-research/android_in_the_wild/README.md at master · google-research/google-research (github.com) | |
World of Bits: An Open-Domain Platform for Web-Based Agents | PMLR 2017 | |||
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration | arXiv | 2018.2.24 | Farama-Foundation/miniwob-plusplus: MiniWoB++: a web interaction benchmark for reinforcement learning (github.com) | MiniWoB++ is an extension of the OpenAI MiniWoB benchmark, and was introduced in the paper Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. |
Mind2Web: Towards a Generalist Agent for the Web | NeurIPS 2023 | 2023.6.9 | [OSU-NLP-Group/Mind2Web: NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" (github.com) | same author of model UGround |
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis | arXiv | 2024.12.27 | https://github.com/OS-Copilot/OS-Genesis | |
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices | arXiv | 2024.6.13 | https://github.com/OpenGVLab/GUI-Odyssey | cross-app GUI navigation |