Google DeepMind 发布了 Genie 3,这是一种突破性的通用世界模型,能够生成高度多样化和交互式的环境。它允许以每秒 24 帧 的帧率进行实时导航,并在几分钟内以 720p 分辨率保持令人印象深刻的一致性。在之前的 Genie 模型的基础上,Genie 3 通过使 AI 代理能够预测环境演变及其在丰富的模拟环境中行为的影响,标志着迈向 AGI 的重要一步。其主要功能包括对物理属性进行建模、模拟自然和虚构世界、探索历史背景,并通过技术突破实现实时交互和长时间环境一致性。Genie 3 还引入了“可提示世界事件”,允许用户通过文本命令改变生成的世界,从而增强交互性并为代理训练实现复杂的“假设”场景。尽管具有这些先进的功能,Genie 3 目前仍面临一些限制,例如受限的代理行动空间、多代理交互方面的挑战以及不完善的现实世界位置表示。DeepMind 强调负责任的开发,并将 Genie 3 作为有限的研究预览版发布给学者和创作者以征求反馈。
Today we are announcing Genie 3, a general purpose world model that can generate an unprecedented diversity of interactive environments.
Given a text prompt, Genie 3 can generate dynamic worlds that you can navigate in real time at 24 frames per second, retaining consistency for a few minutes at a resolution of 720p.
Watch
Towards world simulation
At Google DeepMind, we have been pioneering research in simulated environments for over a decade, from training agents to master real-time strategy games to developing simulated environments for open-ended learning and robotics. This work motivated our development of world models, which are AI systems that can use their understanding of the world to simulate aspects of it, enabling agents to predict both how an environment will evolve and how their actions will affect it.
World models are also a key stepping stone on the path to AGI, since they make it possible to train AI agents in an unlimited curriculum of rich simulation environments. Last year we introduced the first foundation world models with Genie 1 and Genie 2, which could generate new environments for agents. We have also continued to push the state of the art in video generation with our models Veo 2 and Veo 3, which exhibit a deep understanding of intuitive physics.
Each of these models marks progress along different capabilities of world simulation. Genie 3 is our first world model to allow interaction in real-time, while also improving consistency and realism compared to Genie 2.
Genie 3 can generate a consistent and interactive world over a longer horizon.
Genie 3’s capabilities include:
The following are recordings of real time interactions from Genie 3.
Modelling physical properties of the world
Experience natural phenomena like water and lighting, and complex environmental interactions.
Simulating the natural world
Generate vibrant ecosystems, from animal behaviors to intricate plant life.
Modelling animation and fiction
Tap into imagination, creating fantastical scenarios and expressive animated characters.
Exploring locations and historical settings
Transcend geographical and temporal boundaries to explore places and past eras.
Pushing the frontier of real-time capabilities
Achieving a high degree of controllability and real-time interactivity in Genie 3 required significant technical breakthroughs. During the auto-regressive generation of each frame, the model has to take into account the previously generated trajectory that grows with time. For example, if the user is revisiting a location after a minute, the model has to refer back to the relevant information from a minute ago. To achieve real-time interactivity, this computation must happen multiple times per second in response to new user inputs as they arrive.
Environmental consistency over a long horizon
In order for AI generated worlds to be immersive, they have to stay physically consistent over long horizons. However, generating an environment auto-regressively is generally a harder technical problem than generating an entire video, since inaccuracies tend to accumulate over time. Despite the challenge, Genie 3 environments remain largely consistent for several minutes, with visual memory extending as far back as one minute ago.
The trees to the left of the building remain consistent throughout the interaction, even as they go in and out of view.
Genie 3’s consistency is an emergent capability. Other methods such as NeRFs and Gaussian Splatting also allow consistent navigable 3D environments, but depend on the provision of an explicit 3D representation. By contrast, worlds generated by Genie 3 are far more dynamic and rich because they’re created frame by frame based on the world description and actions by the user.
Promptable world events
In addition to navigational inputs, Genie 3 also enables a more expressive form of text-based interaction, which we refer to as promptable world events.
Promptable world events make it possible to change the generated world, like altering weather conditions or introducing new objects and characters, enhancing the experience from navigation controls.
This ability also increases the breadth of counterfactual, or “what if” scenarios, that can be used by agents learning from experience to handle unexpected situations.
Choose a world setting. Then, pick an event, and see Genie 3 create it.
Fueling embodied agent research
To test the compatibility of Genie 3 created worlds for future agent training, we generated worlds for a recent version of our SIMA agent, our generalist agent for 3D virtual settings. In each world we instructed the agent to pursue a set of distinct goals, which it aims to achieve by sending navigation actions to Genie 3. Like any other environment, Genie 3 is not aware of the agent’s goal, instead it simulates the future based on the agent's actions.
Choose a world setting. Then, pick a goal you'd like an agent to achieve and watch how it accomplishes it.
Since Genie 3 is able to maintain consistency, it is now possible to execute a longer sequence of actions, achieving more complex goals. We expect this technology to play a critical role as we push toward AGI, and agents play a greater role in the world.
Limitations
While Genie 3 pushes the boundaries of what world models can accomplish, it's important to acknowledge its current limitations:
- Limited action space. Although promptable world events allow for a wide range of environmental interventions, they are not necessarily performed by the agent itself. The range of actions agents can perform directly is currently constrained.
- Interaction and simulation of other agents. Accurately modeling complex interactions between multiple independent agents in shared environments is still an ongoing research challenge.
- Accurate representation of real-world locations. Genie 3 is currently unable to simulate real-world locations with perfect geographic accuracy.
- Text rendering. Clear and legible text is often only generated when provided in the input world description.
- Limited interaction duration. The model can currently support a few minutes of continuous interaction, rather than extended hours.
Responsibility
We believe foundational technologies require a deep commitment to responsibility from the very beginning. The technical innovations in Genie 3, particularly its open-ended and real-time capabilities, introduce new challenges for safety and responsibility. To address these unique risks while aiming to maximize the benefits, we have worked closely with our Responsible Development & Innovation Team.
At Google DeepMind, we're dedicated to developing our best-in-class models in a way that amplifies human creativity, while limiting unintended impacts. As we continue to explore the potential applications for Genie, we are announcing Genie 3 as a limited research preview, providing early access to a small cohort of academics and creators. This approach allows us to gather crucial feedback and interdisciplinary perspectives as we explore this new frontier and continue to build our understanding of risks and their appropriate mitigations. We look forward to working further with the community to develop this technology in a responsible way.
Next steps
We believe Genie 3 is a significant moment for world models, where they will begin to have an impact on many areas of both AI research and generative media. To that end, we're exploring how we can make Genie 3 available to additional testers in the future.
Genie 3 could create new opportunities for education and training, helping students learn and experts gain experience. Not only can it provide a vast space to train agents like robots and autonomous systems, Genie 3 can also make it possible to evaluate agents’ performance, and explore their weaknesses.
At every step, we’re exploring the implications of our work and developing it for the benefit of humanity, safely and responsibly.
Acknowledgments
Genie 3 was made possible due to key research and engineering contributions from Phil Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleks Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Ben Poole, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang and Jessica Yung.
We thank Cip Baetu, Jordi Berbel, David Bridson, Jake Bruce, Gavin Buttimore, Sarah Chakera, Bilva Chandra, Paul Collins, Alex Cullum, Bogdan Damoc, Vibha Dasagi, Maxime Gazeau, Charles Gbadamosi, Woohyun Han, Ed Hirst, Ashyana Kachra, Lucie Kerley, Kristian Kjems, Eva Knoepfel, Vika Koriakin, Jessica Lo, Cong Lu, Zeb Mehring, Alex Moufarek, Henna Nandwani, Valeria Oliveira, Fabio Pardo, Jane Park, Andrew Pierson, Ben Poole, Helen Ran, Nilesh Ray, Tim Salimans, Manuel Sanchez, Igor Saprykin, Amy Shen, Sailesh Sidhwani, Duncan Smith, Joe Stanton, Hamish Tomlinson, Dimple Vijaykumar, Luyu Wang, Piers Wingfield, Nat Wong, Keyang Xu, Christopher Yew, Nick Young and Vadim Zubov for their invaluable partnership in developing and refining key components of this project.
Thanks to Tim Rocktäschel, Satinder Singh, Adrian Bolton, Inbar Mosseri, Aäron van den Oord, Douglas Eck, Dumitru Erhan, Raia Hadsell, Zoubin Gharamani, Koray Kavukcuoglu and Demis Hassabis for their insightful guidance and support throughout the research process.
Feature video was produced by Suz Chambers, Matthew Carey, Alex Chen, Andrew Rhee, JR Schmidt, Scotch Johnson, Heysu Oh, Kaloyan Kolev, Arden Schager, Sam Lawton, Hana Tanimura, Zach Velasco, Ben Wiley, and Dev Valladares. Including samples generated by Signe Norly, Eleni Shaw, Andeep Toor, Gregory Shaw, and Irina Blok.
Finally, we extend our gratitude to Mohammad Babaeizadeh, Gabe Barth-Maron, Parker Beak, Jenny Brennan, Tim Brooks, Max Cant, Harris Chan, Jeff Clune, Kaspar Daugaard, Dumitru Erhan, Ashley Feden, Simon Green, Nik Hemmings, Michael Huber, Jony Hudson, Dirichi Ike-Njoku, Bonnie Li, Simon Osindero, Georg Ostrovski, Ryan Poplin, Alex Rizkowsky, Giles Ruscoe, Ana Salazar, Guy Simmons, Jeff Stanway, Metin Toksoz-Exley, Petko Yotov, Mingda Zhang and Martin Zlocha for their insights and support.
