CogAgent: Automate Any GUI with Vision—No Code or HTML Needed

Paper & Code

CogAgent: A Visual Language Model for GUI Agents

2024 • THUDM/CogAgent

★1104

Imagine giving a natural language instruction like “Mark all unread emails as read” or “Filter Amazon search results to show only Mastercraft doors under $200,” and having an AI agent carry it out—just by looking at your screen. That’s the promise of CogAgent, an open-source visual language model (VLM) purpose-built for understanding and interacting with graphical user interfaces (GUIs) on computers and smartphones.

Unlike conventional large language models (LLMs) such as ChatGPT—which can write emails but can’t click buttons or type into fields because they lack visual perception—CogAgent operates directly on screenshots. It requires no access to HTML, DOM trees, or backend APIs. This makes it uniquely suited for automating real-world digital tasks where code-level access simply doesn’t exist—think legacy enterprise software, mobile apps without developer tools, or closed-source desktop applications.

Developed by Tsinghua University and Zhipu AI, CogAgent represents a significant leap in GUI agent technology. The latest version, CogAgent-9B-20241220, delivers state-of-the-art performance across multiple benchmarks while remaining open-source and commercially usable under the Apache 2.0 license.

Why Traditional LLMs Fall Short on GUI Tasks

Most LLM-based automation tools rely on extracted textual representations of a screen—such as HTML or accessibility trees—to understand interface layout. But these representations often miss visual context, fail to capture dynamic elements (like pop-ups or overlays), and are unavailable on platforms like mobile apps or proprietary software.

CogAgent sidesteps this limitation entirely. By processing raw screenshots at high resolution (up to 1120×1120), it can detect tiny UI elements, read small text, and precisely localize interactive components. This visual-first approach enables it to outperform even HTML-augmented LLMs on real-world GUI navigation benchmarks like Mind2Web and AITW (Android in the Wild)—proving that seeing is believing, especially when automating user interfaces.

Key Capabilities That Make CogAgent Stand Out

High-Resolution GUI Perception

CogAgent’s dual-encoder architecture supports input resolution of 1120×1120 pixels, far exceeding most VLMs. This allows it to accurately interpret dense, text-heavy interfaces—such as email clients, e-commerce sites, or data dashboards—where small fonts and tightly packed buttons would confuse lower-resolution models.

Strong Performance on Real GUI Agent Benchmarks

CogAgent consistently leads or matches top commercial systems on critical evaluation suites:

Screenspot: excels at GUI element grounding (i.e., “Where is the ‘Send’ button?”)
OmniAct: dominates single-step action prediction
OSWorld: achieves near-top results on complex, multi-step desktop tasks
CogAgentBench-basic-cn: sets the standard for Chinese-language GUI interaction

Notably, it outperforms LLM+HTML methods on both PC and Android tasks—despite using only screenshots as input.

Bilingual Support and Platform Flexibility

The model supports English and Chinese instructions and works across three major platforms:

Windows 10/11 (WIN)
macOS 14/15 (Mac)
Android 13–15 (Mobile)

This makes it valuable for global development teams and multilingual user bases.

Open Source and Commercially Usable

Unlike many proprietary GUI agents, CogAgent is fully open-sourced under the Apache 2.0 license, allowing free use in research and commercial products. Its weights and code are publicly available, fostering transparency and community-driven improvement.

Practical Use Cases Where CogAgent Shines

CogAgent isn’t just a research prototype—it solves tangible problems:

Automating Repetitive Workflows: Bulk-editing spreadsheets, sending templated emails, or filling out web forms without RPA bots that break on UI changes.
Assistive Technology: Enabling voice- or text-controlled navigation for users with motor impairments.
Legacy System Integration: Interfacing with old desktop applications that lack APIs or documentation—by simply “watching” the screen.
UI Testing & Validation: Automatically verifying that button clicks lead to expected screens or that error messages appear under edge cases.
Mobile App Automation: Performing tasks on Android apps where instrumentation (e.g., via Appium) is restricted or unreliable.

In all these scenarios, CogAgent thrives where traditional automation fails: when you only have pixels, not code.

Getting Started: How CogAgent Works in Practice

Using CogAagent follows a clear, structured workflow:

Input a screenshot of the current screen.
Provide a task description in natural language (e.g., “Click ‘Checkout’ and apply promo code SAVE10”).
Specify the platform (WIN, Mac, or Mobile).
(Optional) Include past action history to support multi-step reasoning.

The model returns a strictly formatted output containing:

Action: Human-readable instruction (e.g., “Click the search bar”)
Grounded Operation: Machine-executable command with screen coordinates, such as CLICK(box=[[352,102,786,139]])
(Optional) Status, Plan, or Sensitivity tags based on the requested output format

For quick experimentation, users can run:

A CLI demo (inference/cli_demo.py) for one-off predictions
A web demo (inference/web_demo.py) for interactive, multi-turn sessions

The model supports multiple response formats (e.g., Action-Operation, Status-Plan-Action-Operation), allowing developers to tailor outputs to their agent pipeline.

Important Limitations and Practical Considerations

While powerful, CogAgent has real-world constraints:

Hardware Requirements: Full-precision (BF16) inference needs ≥29GB VRAM (e.g., A100/H100). Lower-precision modes (INT4/INT8) reduce memory use but significantly hurt performance—not recommended for production.
Not a Chatbot: It’s a task-execution agent, not a conversational model. Each interaction is stateless unless history is explicitly provided.
Platform Limitations: Only Windows, macOS, and Android are officially supported. Other OSes may yield poor results.
Image Input Mandatory: Pure text prompts without screenshots will fail—visual context is essential.
Fine-Tuning Is Resource-Intensive: Supervised fine-tuning (SFT) requires 8×A100 GPUs, while LoRA tuning needs a single 70GB GPU.

These factors mean CogAgent is best suited for teams with access to capable GPU infrastructure and clearly defined automation tasks.

Summary

CogAgent redefines what’s possible in GUI automation by combining high-resolution visual understanding with precise action generation—all from screenshots and natural language. It solves a critical gap left by LLMs: the inability to see and act in real digital environments.

For developers, researchers, and product teams seeking to automate tasks where APIs don’t exist or UIs change frequently, CogAgent offers a robust, open-source, and vision-native alternative. With strong benchmark performance, bilingual support, and commercial-friendly licensing, it stands out as one of the most practical GUI agents available today.

If your project involves automating desktop or mobile interfaces—especially without backend access—CogAgent is worth serious consideration.