ScreenEnv:部署你的全功能桌面代理

该库提供了两种灵活的集成方式,以适应不同的架构需求:直接的沙箱 API,适用于需要细粒度程序控制的自定义代理框架;以及专门为遵循模型上下文协议的 AI 系统量身定制的 MCP 服务器集成。本文为这两种方法提供了清晰的代码示例,展示了它们的易用性。此外,它还提供了一个实用的分步指南,介绍如何结合 `smolagents` 利用 ScreenEnv 构建一个自定义的 AI 驱动的桌面代理。本指南涵盖了选择后端视觉语言模型(VLM)等基本方面,还包括定义自定义桌面交互工具(如点击、写入、按下),以及执行代理以执行复杂的、多步骤的桌面任务(如在 LibreOffice 中生成报告)。展望未来,ScreenEnv 旨在将其支持范围扩展到 Linux 之外,包括安卓、macOS 和 Windows,从而为真正的跨平台图形用户界面自动化以及为代理基准测试和评估创建高度可复现的环境铺平道路。





TL;DR: ScreenEnv is a powerful Python library that lets you create isolated Ubuntu desktop environments in Docker containers for testing and deploying GUI Agents (aka Computer Use agents). With built-in support for the Model Context Protocol (MCP), it's never been easier to deploy desktop agents that can see, click, and interact with real applications.

What is ScreenEnv?



Imagine you need to automate desktop tasks, test GUI applications, or build an AI agent that can interact with software. This used to require complex VM setups and brittle automation frameworks.

ScreenEnv changes this by providing a sandboxed desktop environment that runs in a Docker container. Think of it as a complete virtual desktop session that your code can fully control - not just clicking buttons and typing text, but managing the entire desktop experience including launching applications, organizing windows, handling files, executing terminal commands, and recording the entire session.

  • 🖥️ Full Desktop Control: Complete mouse and keyboard automation, window management, application launching, file operations, terminal access, and screen recording
  • 🤖 Dual Integration Modes: Support both Model Context Protocol (MCP) for AI systems and direct Sandbox API - adapting to any agent or backend logic
  • 🐳 Docker Native: No complex VM setup - just Docker. The environment is isolated, reproducible, and easily deployed anywhere in less than 10 seconds. Support AMD64 and ARM64 architecture.

🎯 One-Line Setup

from screenenv import Sandbox
sandbox = Sandbox()  # That's it!

Two Integration Approaches

ScreenEnv provides two complementary ways to integrate with your agents and backend systems, giving you flexibility to choose the approach that best fits your architecture:

Option 1: Direct Sandbox API

Perfect for custom agent frameworks, existing backends, or when you need fine-grained control:

from screenenv import Sandbox

# Direct programmatic control
sandbox = Sandbox(headless=False)
sandbox.launch("xfce4-terminal")
sandbox.write("echo 'Custom agent logic'")
screenshot = sandbox.screenshot()
image = Image.open(BytesIO(screenshot_bytes))
...
sandbox.close()
# If close() isn’t called, you might need to shut down the container yourself.

Option 2: MCP Server Integration

Ideal for AI systems that support the Model Context Protocol:

from screenenv import MCPRemoteServer
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

# Start MCP server for AI integration
server = MCPRemoteServer(headless=False)
print(f"MCP Server URL: {server.server_url}")

# AI agents can now connect and control the desktop
async def mcp_session():
    async with streamablehttp_client(server.server_url) as streams:
        async with ClientSession(*streams) as session:
            await session.initialize()
            print(await session.list_tools())

            response = await session.call_tool("screenshot", {})
            image_bytes = base64.b64decode(response.content[0].data)
            image = Image.open(BytesIO(image_bytes))

server.close()
# If close() isn’t called, you might need to shut down the container yourself.

This dual approach means ScreenEnv adapts to your existing infrastructure rather than forcing you to change your agent architecture.

✨ Create a Desktop Agent with screenenv and smolagents

screenenv natively supports smolagents, making it easy to build your own custom Desktop Agent for automation. Here’s how to create your own AI-powered Desktop Agent in just a few steps:

1. Choose Your Model

Pick the backend VLM you want to power your agent.

import os

from smolagents import OpenAIServerModel
model = OpenAIServerModel(
    model_id="gpt-4.1",
    api_key=os.getenv("OPENAI_API_KEY"),
)

# Inference Endpoints
from smolagents import HfApiModel
model = HfApiModel(
    model_id="Qwen/Qwen2.5-VL-7B-Instruct",
    token=os.getenv("HF_TOKEN"),
    provider="nebius",
)

# Transformer models
from smolagents import TransformersModel
model = TransformersModel(
    model_id="Qwen/Qwen2.5-VL-7B-Instruct",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

# Other providers
from smolagents import LiteLLMModel
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514")

# see smolagents to get the list of available model connectors

2. Define Your Custom Desktop Agent

Inherit from DesktopAgentBase and implement the _setup_desktop_tools method to build your own action space!

from screenenv import DesktopAgentBase, Sandbox
from smolagents import Model, Tool, tool
from smolagents.monitoring import LogLevel
from typing import List

class CustomDesktopAgent(DesktopAgentBase):
    """Agent for desktop automation"""

    def __init__(
        self,
        model: Model,
        data_dir: str,
        desktop: Sandbox,
        tools: List[Tool] | None = None,
        max_steps: int = 200,
        verbosity_level: LogLevel = LogLevel.INFO,
        planning_interval: int | None = None,
        use_v1_prompt: bool = False,
        **kwargs,
    ):
        super().__init__(
            model=model,
            data_dir=data_dir,
            desktop=desktop,
            tools=tools,
            max_steps=max_steps,
            verbosity_level=verbosity_level,
            planning_interval=planning_interval,
            use_v1_prompt=use_v1_prompt,
            **kwargs,
        )

        # OPTIONAL: Add a custom prompt template - see src/screenenv/desktop_agent/desktop_agent_base.py for more details about the default prompt template
        # self.prompt_templates["system_prompt"] = CUSTOM_PROMPT_TEMPLATE.replace(
        #     "<<resolution_x>>", str(self.width)
        # ).replace("<<resolution_y>>", str(self.height))
        # Important: Adjust the prompt based on your action space to improve results.

    def _setup_desktop_tools(self) -> None:
        """Define your custom tools here."""
        
        
        @tool
        def click(x: int, y: int) -> str:
            """
            Clicks at the specified coordinates.
            Args:
                x: The x-coordinate of the click
                y: The y-coordinate of the click
            """
            self.desktop.left_click(x, y)
            # self.click_coordinates = (x, y) to add the click coordinate to the observation screenshot 
            return f"Clicked at ({x}, {y})"
        
        self.tools["click"] = click
        

        @tool
        def write(text: str) -> str:
            """
            Types the specified text at the current cursor position.
            Args:
                text: The text to type
            """
            self.desktop.write(text, delay_in_ms=10)
            return f"Typed text: '{text}'"

        self.tools["write"] = write

        @tool
        def press(key: str) -> str:
            """
            Presses a keyboard key or combination of keys
            Args:
                key: The key to press (e.g. "enter", "space", "backspace", etc.) or a multiple keys string to press, for example "ctrl+a" or "ctrl+shift+a".
            """
            self.desktop.press(key)
            return f"Pressed key: {key}"

        self.tools["press"] = press
        
        @tool
        def open(file_or_url: str) -> str:
            """
            Directly opens a browser with the specified url or opens a file with the default application.
            Args:
                file_or_url: The URL or file to open
            """

            self.desktop.open(file_or_url)
            # Give it time to load
            self.logger.log(f"Opening: {file_or_url}")
            return f"Opened: {file_or_url}"

        @tool
        def launch_app(app_name: str) -> str:
            """
            Launches the specified application.
            Args:
                app_name: The name of the application to launch
            """
            self.desktop.launch(app_name)
            return f"Launched application: {app_name}"

        self.tools["launch_app"] = launch_app

        ... # Continue implementing your own action space.

3. Run the Agent on a Desktop Task

from screenenv import Sandbox

# Define your sandbox environment
sandbox = Sandbox(headless=False, resolution=(1920, 1080))

# Create your agent
agent = CustomDesktopAgent(
    model=model,
    data_dir="data",
    desktop=sandbox,
)

# Run a task
task = "Open LibreOffice, write a report of approximately 300 words on the topic ‘AI Agent Workflow in 2025’, and save the document."

result = agent.run(task)
print(f"📄 Result: {result}")

sandbox.close()

If you encounter acces denied docker error, you can try to run the agent with sudo -E python -m test.py or add your user to the docker group.

💡 For a comprehensive implementation, see this CustomDesktopAgent source on GitHub.

Get Started Today

# Install ScreenEnv
pip install screenenv

# Try the examples
git clone git@github.com:huggingface/screenenv.git
cd screenenv
python -m examples.desktop_agent
# use 'sudo -E python -m examples.desktop_agent` if you're not in 'docker' group

What's Next?

ScreenEnv aims to expand beyond Linux to support Android, macOS, and Windows, unlocking true cross-platform GUI automation. This will enable developers and researchers to build agents that generalize across environments with minimal setup.

These advancements pave the way for creating reproducible, sandboxed environments ideal for benchmarking and evaluation.

Repository: https://github.com/huggingface/screenenv



AI 前线

奥特曼万字访谈:揭秘 GPT-5 幕后艰辛,宣告超级智能前夜已至

2026-1-3 2:07:32

AI 前线

如何在 Azure Kubernetes 服务中运行 Postgres 数据库并将其与 Node.js Express 应用程序集成

2026-1-3 2:07:34

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索