Cloudflare 在 R2 SQL 中引入聚合功能以支持数据分析

Cloudflare 宣布通过引入聚合支持显著扩展了 R2 SQL 的能力。这一增强功能包括 SUM、COUNT、AVG、MIN、MAX 等聚合函数以及 GROUP BY 和 HAVING 等 SQL 子句，使开发人员能够直接在 R2 存储的数据上执行分析工作负载。这消除了对独立数据仓库工具和复杂 OLAP 基础设施的需求。文章强调这些新功能支持快速数据汇总、趋势发现、报告生成和日志异常检测。此外，R2 SQL 现在支持模式发现命令（SHOW TABLES、DESCRIBE）。Cloudflare 解释了使用 scatter-gather 和 shuffling 策略执行 GROUP BY 查询的底层分布式执行机制。文章还提到了相关更新，例如 R2 Data Catalog 中 Apache Iceberg 表的自动快照过期和压缩功能，这些功能进一步优化了查询性能。虽然 R2 SQL 仍处于公开测试阶段，但这些更新表明 Cloudflare 持续努力将数据分析能力推向边缘，使其开发者平台能够支持更广泛的工作负载

Cloudflare recently announced support for aggregations in R2 SQL, a new feature that lets developers run SQL queries on data stored in R2. This enhancement expands R2 SQL beyond basic filtering and makes it more useful for analytical workloads without requiring separate data warehouse tools.

R2 SQL now supports SUM, COUNT, AVG, MIN, and MAX, as well as GROUP BY and HAVING clauses. These aggregation functions let developers run SQL analytics directly on data stored in R2 via the R2 Data Catalog, enabling them to quickly summarize data, spot trends, generate reports, and identify unusual patterns in logs. In addition to aggregations, the update introduces schema discovery commands, including SHOW TABLES and DESCRIBE.

Jérôme Schneider, staff software engineer at Cloudflare, Nikita Lapkov, senior software engineer at Cloudflare, and Marc Selwan, senior product manager at Cloudflare, summarize:

Whether you are generating reports, monitoring high-volume logs for anomalies, or simply trying to spot trends in your data, you can now easily do it all within Cloudflare's Developer Platform without the overhead of managing complex OLAP infrastructure or moving data out of R2.

Jeremy Daly, director of research at CloudZero, comments in his newsletter:

Cloudflare continues to push data closer to the edge with aggregation support in R2 SQL, expanding the kinds of workloads developers can realistically run there.

Source: Cloudflare blog

Schneider, Lapkov, and Selwan explain how they built a distributed GROUP BY execution using scatter-gather and shuffling strategies to run analytics directly over the R2 Data Catalog:

Aggregate queries without "HAVING" and "ORDER BY" can be executed in a fashion similar to filter queries. For filter queries, R2 SQL picks one node to be the coordinator in query execution. This node analyzes the query and consults R2 Data Catalog to figure out which Parquet row groups may contain data relevant to the query. Each Parquet row group represents a relatively small piece of work that a single compute node can handle. Coordinator node distributes the work across many worker nodes and collects results to return them to the user.

Cloudflare has separately announced that R2 Data Catalog now supports automatic snapshot expiration for Apache Iceberg tables, complementing automatic compaction, which optimizes query performance by combining small data files into larger ones. Selwan comments:

These go hand in hand because the metadata cleanup/management that snapshot expiration helps with will speed up performance of these aggregation queries, especially with compaction enabled.

The hyperscaler has recently published a deep-dive article that documents how its distributed query engine works.

As R2 SQL is still in public beta, the supported SQL grammar may change over time. A documentation page covers the current limitations and best practices.

{{userData.name}}已认证

Cloudflare 在 R2 SQL 中引入聚合功能以支持数据分析

Pulumi 新增对 Terraform 和 HCL 的原生支持

没 KPI 反而爆了？Cursor 大神一人敲出核心功能！CEO 上手 7 天不宕机，AI 编程玩法被打假

从高考到实战，豆包大模型交卷了｜机器之心

什么是智能体？

谷歌开放世界模型一夜刷屏，AI 游戏门槛归零时刻来了？

如何聪明地“催促”用户？一个未实名提醒方案的“平衡”之道

AI 时代下阿里云基础设施的稳定性架构揭秘

张颖对话徒手攀岩 Alex：每一次出发，是为了兑现无数次演练后的笃定 |【经纬低调分享】