Google I/O 2026 Explodes in Influence: World Models & AI Digital Watermarks Debut! Gemini Officially Enters the Agent Era

In the early hours of Beijing time on May 20, Google’s annual developer conference Google I/O 2026 kicked off in Mountain View, California. Google CEO Sundar Pichai officially declared: We have entered the era of Gemini intelligent agents.
Ten Years of AI-First Strategy: From Labs to Billions of People
"Ten years ago, we steered the company toward an AI-first strategy. To this day, we still believe artificial intelligence is the most profound way to fulfill our mission and improve people’s lives on a massive scale," Pichai said in his opening keynote.
He cited striking data to illustrate the explosive growth of AI over the past year:
Google’s monthly processed tokens surged from 9.7 trillion two years ago to 480 trillion last year, and has now broken 3.2 quadrillion, a seven-fold increase;
Monthly active users of Gemini products rose from 400 million last year to over 900 million today, with daily request volume also up seven times;
Launched just one year ago, Search AI Mode has surpassed 1 billion monthly active users, standing as one of the fastest-growing features in Google’s history;
More than 8.5 million developers use Google’s AI models to build applications every month.
All these achievements are underpinned by Google’s unprecedented infrastructure investment. Pichai disclosed that Google’s capital expenditure hit 31 billion US dollars in 2022, and will reach 180 to 190 billion US dollars in 2026, a nearly six-fold increase.
Dual-Chip Strategy Unveiled: Breaking the Invisible Physical Boundaries of Data Centers
Against the backdrop of global power shortages and physical capacity limits of single data centers, Google rolled out its pioneering dual-chip strategy, dividing the underlying hardware architecture into two independent tracks: TPU 8t dedicated to large-scale pre-training, and TPU 8i fully optimized for extreme high-concurrency inference.
As a powerhouse for model training, TPU 8t delivers nearly three times the raw computing power of its predecessor. More importantly, it brings fundamental upgrades to the underlying software stack. Through in-depth restructuring of core underlying distributed frameworks JAX and Pathways, Google opens up brand-new possibilities for the industry: model training is no longer confined by the physical scope of a single super-large data center. Relying on the full-network scheduling capability of the two core technologies, Google has realized cross-regional collaborative training across multiple physical sites worldwide, seamlessly connecting over one million TPUs globally.
This technological breakthrough puts an end to the old arms race of simply expanding standalone data centers. For model developers, it cuts the training cycle of ultra-large-scale models from several months down to mere weeks. Meanwhile, TPU 8i for inference targets the biggest pain point of commercial AI adoption — latency. The core development philosophy summed up by Google over 27 years of search business is fully embodied in this chip: in the AI era, latency still determines the survival of applications. Every single inference step is accelerated at the hardware level on TPU 8i, enabling real-time responses of upper-layer intelligent agents.
Gemini 3.5 & Gemini Omni: Upgrade from Text Probability Calculation to Physical World Simulator
Leaps in infrastructure have directly boosted the launch of a new generation of foundational model families. Google officially unveiled the Gemini 3.5 series at the conference with targeted iteration ideas: instead of blindly pursuing larger parameter scales, it fully shifts focus to balancing speed, cost efficiency and execution capability.
Globally launched as the default mainstream model starting immediately, Gemini 3.5 Flash has reshaped industry cognition of lightweight models. It outperforms the previous flagship Gemini 3.1 Pro in various benchmark tests, and shows prominent advantages especially in coding competence and the newly introduced GDPVal (Global Dynamic Value Assessment).
While retaining top-tier intelligence, Gemini 3.5 Flash achieves four times the token output speed per second compared with other cutting-edge lightweight models in the same tier. Such extreme cost-performance edge serves as Google’s strategic trump card to gain dimensional advantages over both open-source and closed-source competitors among developers.
The more subversive underlying breakthrough lies in the newly launched Gemini Omni family. Rather than a conventional multi-modal model, Google defines it as a genuine world model.
Essentially, Gemini Omni is a unified neural network capable of converting any input modality including text, image, video and audio into any form of output. As the first accessible product of this series, Gemini Omni Flash can not only fully understand diverse audio-visual inputs, but also gain intuitive perception of the physical world. Live demonstrations proved that the model is able to comprehend physical dynamics, kinetic energy conversion and gravitational effects, and has already been applied to the training of cutting-edge robots.
In terms of user experience, Gemini Omni Flash completely blurs the line between logical reasoning and content generation. In one showcase, users can accurately edit complex stop-motion animation clips about amino acids merely via natural conversations, including background replacement, shot adjustment and revision of characters’ physical movement trajectories. The model can render high-quality cinematic videos in real time according to dialogue instructions. Gemini Omni integrates all capabilities of independent models such as Nano, Genie and Veo into one, completing the upgrade from multi-modal to full-modal AI.
Gemini Spark: 24/7 On-Duty Personal AI Agent
The iteration of foundational models has dramatically revolutionized application-side interaction logic. Google grandly launched Gemini Spark for mass users, a personal AI agent capable of autonomous background operation round the clock.
Different from the traditional passive interaction mode where users input prompts for single AI replies, Gemini Spark is built on Google’s brand-new Antigravity development platform and features rare proactive attributes. It acts as a tireless digital secretary running hidden in the system background, and keeps processing complex tasks on the cloud even when users close laptops or lock phone screens. It can automatically sort out users’ bank bills from last month to accurately mark hidden continuous deduction subscriptions; it can also retrieve emails and schedules of all family members in real time, and automatically generate concise and action-oriented daily family briefings every morning.
Such in-depth underlying autonomy is built on a solid ecological alliance. Gemini Spark not only connects Google’s own Workspace suites, but also achieves in-depth integration with more than 30 mainstream third-party applications including Adobe, Asana, Dropbox, Lyft, Uber and Zillow via the MCP protocol. Free from the constraints of isolated applications, the intelligent agent can link up complex action workflows across platforms.
To prevent autonomous agents from running out of control during task execution, Google simultaneously released the supporting underlying compliance defense system — Agent Payments Protocol (AP2). As the top rule governing interactions between Gemini Spark and all external commercial interfaces, AP2 strictly restricts unauthorized consumption and financial commitments made by AI without users’ explicit consent, securing users’ financial assets amid the evolution toward high-autonomy AI.
On the developer side, the launch of Google Flow popularizes the concept of Vibe Code to the public. Both developers and creative practitioners with zero coding knowledge can directly build complex visual tools, stop-motion animation layers and video effect tools in real time within the Flow environment through intent-based conversations, hand-drawn sketches or audio-visual material combination. Development is no longer tedious syntax compilation, but real-time capture and presentation of creativity.
Ambition Beyond Multi-Modal Search: Toward Full-Scenario AI Layout
In broader ecological implementation, Google intends to comprehensively restructure traditional internet entry points. The AI Mode of Google Search is equipped with Generative UI technology.
Now when users raise complex systematic questions in the search box — such as exploring the orbital cycles of planets in the solar system — the search engine no longer merely displays plain links and summary texts. Instead, it dynamically assembles and renders fully interactive celestial orbit simulation components on the front end in real time based on user demands. Web pages are no longer rigid pre-designed layouts, but exclusive application programs generated instantly according to user intentions. Combined with universal shopping carts and information agents, the entire search experience is evolving into a closed-loop decision-making and execution engine.
To address the trust crisis brought by the proliferation of generative content, Google announced the full expansion of SynthID AI digital watermark technology from original multi-modal generation software to the underlying layers of Google Search and Chrome. Users can call up C2PA content certification information within milliseconds via frame selection search or right-click operations, effectively curbing the spread of fake synthetic content.
Google also rolled out multiple new functions for office and daily high-frequency scenarios:
Docs Live: Users can freely voice out scattered ideas, and Gemini will sort them into well-structured, logically rigorous and polished formal documents in real time in the background.
Google Pics: A brand-new core component for in-depth image generation and generative editing, which completely reconstructs the visual asset creation workflow within Workspace.
Daily Brief: An out-of-the-box resident agent that arranges priority schedules and core tasks for users every morning.
At the event, Google also announced in-depth hardware cooperation with Samsung, and jointly launched new smart glasses embedded with full Gemini intelligent capabilities together with fashion eyewear brands Warby Parker and Gentle Monster.
From industry insiders’ perspective, what Google aims to deliver to the market is a brand-new working paradigm: models take charge of comprehension, agents take charge of execution, and products embed execution capabilities into every high-frequency usage scenario. Previously scattered entrances including Search, Gmail, Docs, YouTube, Shopping, Android and Chrome are now connected under a unified Gemini logic.
Current AI competition is no longer about generating more human-like replies, but getting things done silently before users even raise requests. According to Google, these new capabilities will be rolled out in batches by region and subscription tier. Nevertheless, judging from its dense product layout and rapid iteration pace, Google has made its strategic goal clear: it strives to seize not only top rankings on model leaderboards, but also the dominant entry position of the next-generation operating system era.
Source: The Paper
Google I/O 2026炸场：世界模型、AI数字水印来了！Gemini正式进入“Agent时代

北京时间5月20日凌晨，谷歌年度开发者大会Google I/O 2026在加州山景城拉开帷幕，谷歌 CEO 桑达尔・皮查伊正式宣布：我们已经进入了"智能体Gemini时代"。
十年 AI-first，从实验室走向数十亿人
"十年前我们将公司转向AI-first，今天我们仍然认为AI是推进我们使命、大规模改善人们生活的最深刻方式。"皮查伊在开场演讲中说道。

他用一组震撼的数据展示了过去一年AI的爆发式增长：
谷歌每月处理的tokens从两年前的9.7万亿，增长到去年的480万亿，今天已经突破3200万亿（3.2 quadrillion），增长了7倍；
Gemini应用月活用户从去年的4亿，增长到今天的9亿多，日请求量增长了7倍；
搜索AI Mode推出仅一年，月活用户已超过10亿，成为谷歌历史上增长最快的功能之一；
超过850万开发者每月使用谷歌的AI模型构建应用。
支撑这一切的是谷歌史无前例的基础设施投入。皮查伊透露，2022年谷歌的资本支出为310亿美元，2026年这一数字将达到1800-1900亿美元，增长了近6倍。
首次推出“双芯片策略”：推倒数据中心的隐形物理墙
面对全球性的电力紧缺与单一数据中心容量逼近物理极限的残酷现实，谷歌首次推出了“双芯片策略”，将底层硬件架构划分为互不干扰的两条主线：专门用于大规模预训练的TPU 8t，以及针对高并发推理极限优化的TPU 8i。

作为训练怪兽，TPU 8t带来了较上一代接近三倍的原始算力提升。但更根本的转变在于底层软件栈的彻底解放。通过底层分布式框架JAX与Pathways的深度重构，谷歌向行业昭示了一种全新的可能：模型训练不再受限于单一巨型数据中心的物理围墙。依靠这两项核心技术的全网调度，谷歌成功在全网范围内实现了跨越多个物理站点的协同训练，在全球范围内无缝串联起超过100万个TPU。
这一技术突破直接瓦解了过去堆砌单体机房的军备竞赛。对于模型构建者而言，这意味着超大规模模型的训练周期从过去的数月缩短至数周。而负责推理的TPU 8i则一击切中了商业化落地的最大痛点——延迟。谷歌在过去27年的搜索工程中提炼出的核心信条在这颗芯片上体现得淋漓尽致：在AI时代，延迟依然决定着应用的生死。TPU 8i在推理执行的每一个微小步骤上都进行了硬件级加速，为上层智能体的实时响应提供了可能。
Gemini 3.5与Gemini Omni：从文本概率到物理世界模拟器的升维
基础设施的跃迁直接催生了全新底座模型家族的落地。谷歌在会场正式揭晓了Gemini 3.5系列模型，其演进策略展现出极强的针对性：不再一味追求参数体量的空前膨胀，而是全面转向“速度、经济性与行动力”的平衡。

作为即日起全球上线的默认主力，Gemini 3.5 Flash的表现几乎打破了行业对轻量化模型的认知。在多项基准测试中，它的表现全面超越了上一代的旗舰主力Gemini 3.1 Pro。尤其在编程能力与全新引入的GDPVal（经济价值评估测试）中，Gemini 3.5 Flash展现出了显著优势。
在维持顶尖智能的同时，Gemini 3.5 Flash在每秒输出Token的数量上，达到了其他同级别前沿模型的四倍。这种极端的性价比优势，是谷歌试图在开发者层面对开源与闭源竞争对手实施降维打击的战略底牌。

更具颠覆性的底层突破来自全新亮相的Gemini Omni家族。这并非传统意义上的多模态模型，谷歌将其定义为一个真正意义上的“世界模型”。
Gemini Omni的本质是一套能够将任何输入模态（文本、图像、视频、音频）转化为任何输出模态的统一网络。作为该家族首款落地的产品，Gemini Omni Flash不仅能完美理解输入的各种视听信号，更具备了对物理世界的直观感知能力。谷歌在现场演示中表明，该模型已经能够理解动力学规律、动能转换与重力效应，并被直接应用于前沿机器人的训练中。
在面向用户的体验层面，Gemini Omni Flash将推理与内容生成的边界彻底模糊。在一场演示中，用户仅通过自然的对话沟通，就能让Omni将一段复杂的关于氨基酸的定格动画短片进行精准编辑，无论是替换背景、调整分镜还是改变角色的物理运动轨迹，模型均能通过对话实时渲染输出高品质的电影级视频。Omni的出现，实际上将Nano、Genie和Veo等独立模型的能力完全熔炼于一炉，完成了多模态向全模态的升维。
Gemini Spark——7*24小时在线的个人AI智能体
模型的升维引发了应用层交互逻辑的剧烈塌陷。谷歌在会上重磅推出了面向大众的全面智能力量——Gemini Spark，一个能够7×24小时在后台自主运转的个人AI智能体。

与过往“用户输入提示词、AI做出单次反馈”的被动交互模式截然不同，Gemini Spark依托于谷歌全新的Antigravity（反重力）开发平台，具备了极为罕见的主动性特征。它像一个永远不知疲倦的数字秘书，隐匿在系统后台，甚至在用户合上笔记本电脑或锁死手机屏幕时，依然在云端不间断地处理复杂任务。它可以自主翻阅用户上个月的银行账单，精准捕捉并标记出那些隐藏很深的连续扣费订阅；也可以实时检索全家人的邮件与日程，在清晨自动生成一份毫无冗余、极具行动导向的家庭日常简报。
这种深入到系统底层的自主性，建立在一项重大的生态联盟之上。Gemini Spark不仅打通了谷歌自身的Workspace组件，更通过MCP深度集成了包括Adobe、Asana、Dropbox、Lyft、Uber、Zillow在内的超过30款主流第三方应用。通过这一协议，智能体彻底摆脱了应用孤岛的限制，能够跨平台串联起复杂的行动链条。
为了防止自主智能体在执行任务时失控，谷歌同步推出了配套的底层合规防御系统——Agent Payments Protocol（智能体支付协议，简称AP2）。作为Gemini Spark与所有外部商业接口交互的最高护栏，AP2严格限制了AI在未经用户显式授权或知情的情况下的任何越权消费或财务承诺行为，确保了技术在向高自主性演进过程中的金融资产安全。

在开发者端，谷歌Flow的推出则让“Vibe Code”的概念走向大众。开发者乃至完全不懂代码的创意人员，如今可以直接在Flow环境中，通过纯粹的意图对话、手绘草图或者视听素材的堆叠，实时构建出各种复杂的视觉工具、定格动画图层或视频特效工具。开发不再是枯燥的语法编译，而变成了对创意的实时捕捉与即时显现。
从多模态搜索到全场景AI的野心
在更广泛的生态落地层面，谷歌展现出了全面解构传统互联网入口的意图。谷歌搜索的AI Mode引入了名为Generative UI（生成式用户界面）的技术。

现在，当用户在搜索框中提出一个复杂的系统性问题——例如探究太阳系行星的轨道周期时，搜索引擎返回的不再是冰冷的链接和一段总结性的文字，而是基于当前问题在前端实时组装、动态渲染出来的一个具备完整交互功能的动态星体轨道模拟组件。网络界面不再是预先设计好的死板网页，而是根据用户意图即时生成的专属应用程序。结合通用购物车与信息智能体的协同，整个搜索体验正在转变为一个闭环的决策与执行引擎。

为了应对生成式内容泛滥可能导致的信任危机，谷歌宣布将SynthID数字水印技术从原有的多模态生成软件全面扩大至Google Search和Chrome浏览器底层。用户可以通过“圈选搜索”或直接右键点击，在毫秒级时间内调取C2PA内容凭证，死死卡住虚假合成内容的生存空间。
而在办公与日常高频场景中，谷歌也推出了各种新的功能：
Docs Live：允许用户通过极其随意的语音倾倒，由Gemini在后台实时将其梳理为结构严密、逻辑严谨、毫无口语碎屑的专业级正式文书。
Google Pics：成为全新的深度图像生成与生成式编辑核心组件，彻底重构了 Workspace 内部的视觉资产创作链路。
Daily Brief：作为开箱即用的常驻Agent，在清晨为用户提供最具优先级的日程和任务穿透式编排。
现场，谷歌还宣布与三星达成深度硬件合作，并联合时尚眼镜品牌Warby Parker及Gentle Monster，共同推出了搭载Gemini Intelligence的全新智能眼镜。
在大模型之家看来，谷歌真正想推向给市场的，是一整套新的工作方式：模型负责理解，代理负责执行，产品负责把执行嵌进每一个高频场景里。Search、Gmail、Docs、YouTube、Shopping、Android、Chrome，这些原本分散的入口，正在被同一套Gemini逻辑重新串起来。
如今的AI竞争，已经不是谁能生成一段更像人的回答，而是谁能在用户还没意识到的时候，把事情默默办完。按谷歌自己的说法，这些能力会按地区和订阅层级分批开放；但从产品线的密度和更新速度看，谷歌已经把牌面摆得很清楚：它要争夺的，不只是模型榜单，更是下一代操作系统式的入口。