Skip to main content

How Apache Fluss Achieves True Pruning in Streaming Storage

Yunhong Zheng
PPMC member of Apache Fluss (Incubating)

Banner

TL;DR:

Apache Kafka's "column pruning" is actually pseudo-pruning. All fields still cross the network, and clients discard unwanted ones after the fact. Apache Fluss redesigns the storage format, server-side read path, and write-side batching strategy from the ground up with Arrow IPC columnar storage, zero-copy server-side pruning, and client-side pre-shuffle batching. The result: pruning 90% of columns yields a 10x read throughput improvement, with performance scaling linearly with the pruning ratio.

Taobao Instant Commerce: Real-Time Decisions at Scale with Apache Fluss

Howie Wang
Data Engineering Expert of Taobao Instant Commerce

Every autumn in China, social media floods with posts about "The First Cup of Milk Tea in Autumn." With a tap on their phone, consumers expect their order delivered within 30 minutes. That effortless experience is no accident: it is the result of Taobao Instant Commerce making thousands of data-driven decisions every second.

Taobao Instant Commerce has scaled from a single-category food delivery service into a high-frequency platform spanning fresh produce, consumer electronics (3C), and beauty products. It operates under two very different modes: steady high-frequency daily transactions, and explosive traffic surges during promotional events where order volumes can multiply within minutes. Both demand the same thing: real-time responsiveness across hundreds of millions of SKUs.

Real-time is not a nice-to-have here; it is the lifeline for three critical functions:

  • Operations: Refresh conversion rates and funnels within 30 seconds.
  • Algorithms: Order prediction models must iterate at minute-level granularity.
  • Quality Assurance: Canary release anomalies must be detected within seconds and trigger instant alerts.

The existing pipeline (built on Kafka, Flink, Paimon, and StarRocks) handled this at one scale.

Note: In Alibaba's internal infrastructure, TT (TimeTunnel) is the internal equivalent of Apache Kafka — a high-throughput distributed message queue. Throughout this post, "Kafka" refers to TT in the Taobao Instant Commerce context. But as the business grew, three fundamental bottlenecks emerged: unbounded state growth from stream joins, mounting complexity in building multi-stream denormalized tables, and excessive resource consumption from lakehouse synchronization. Together they formed an impossible triangle: no matter how the team tuned the system, latency, consistency, and cost could not all be optimized at once.

Fluss broke this impasse. By replacing the fragmented stream-batch architecture with a unified storage layer, its features (Delta Join, Partial Update, Streaming-Lakehouse Unification, Column Pruning, and Auto-Increment Columns) systematically eliminated all three bottlenecks and fundamentally reshaped how Taobao Instant Commerce handles real-time decision-making at scale.

Real-Time Multi-Dimensional Unique Visitor Deduplication in Practice

Yang Wang
Apache Fluss (Incubating) Contributor

UV (Unique Visitors) measures the count of distinct users who visited a page or triggered an event within a given time window — unlike PV (Page Views), which counts every request regardless of who made it. For any product or platform, accurate real-time UV statistics across dimensions like channel, city, date, and hour are a core analytical requirement. The full combination of four dimensions means 16 grouping methods; when the dimension count increases to seven, the number of possible groupings reaches 128.

How can multi-dimensional deduplication be both accurate and flexible while maintaining real-time performance? Behind this challenge lie two very different computing paradigms: direct deduplication of raw data, or set operations based on bitmaps.

Why Apache Fluss Chose Rust for Its Multi-Language SDK

Luo Yuxia
PPMC member of Apache Fluss (Incubating)
Keith Lee
Apache Fluss (Incubating) Committer
Anton Borisov
Contributor of Apache Fluss (Incubating)

Banner

If you maintain a data system that only speaks Java, you will eventually hear from someone who doesn't. A Python team building a feature store. A C++ service that needs sub-millisecond writes. An AI agent that wants to call your system through a tool binding. They all need the same capabilities (writes, reads, lookups) and none of them want to spin up a JVM to get them.

Apache Fluss, streaming storage for real-time analytics and AI, hit this exact inflection point. The Java client works well for Flink-based compute, where the JVM is already the world you live in. But outside that world, asking consumers to run a JVM sidecar just to write a record or look up a key creates friction that compounds across every service, every pipeline, every agent in the stack.

We could have written a separate client for each language. Maintain five copies of the wire protocol, five implementations of the batching logic, five sets of retry semantics and idempotence tracking. That path scales linearly with languages and ends predictably: the Java client gets features first, the Python client gets them six months later with slightly different edge-case behavior, and the C++ client is perpetually "almost done."

We took a different path and tried to leverage the lessons of the great.

Announcing Apache Fluss (Incubating) Rust, Python, and C++ Client 0.1.0 Release

Luo Yuxia
PPMC member of Apache Fluss (Incubating)
Keith Lee
Apache Fluss (Incubating) Committer
Anton Borisov
Contributor of Apache Fluss (Incubating)

Banner

We are excited to announce the release of fluss-rust clients 0.1.0, the first official release of the Rust, Python, and C++ clients for Apache Fluss. This 0.1.0 release represents the culmination of 210+ commits from the community, delivering a feature-rich multi-language client from the ground up.

Under the hood, all three clients share a single Rust core that handles protocol negotiation, batching, retries, and Apache Arrow-based data exchange, with thin language-specific bindings on top. This was a deliberate community decision to deliver native performance and feature parity across every language from day one.

What does Apache Fluss mean in the context of AI?

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

The Data Foundation for Real-Time Intelligent Systems

Apache Fluss (Incubating) started as streaming storage for real-time analytics, built to work closely with stream processors like Apache Flink. Its focus has always been on freshness, efficient analytical access, and continuous data, making fast-changing streams directly usable without forcing them through batch-oriented systems or log-only pipelines.

Over the last year, Fluss has expanded beyond this original framing. You’ll now see it described as streaming storage for real-time analytics and AI. This change reflects how data systems are being used today: more workloads depend on continuously updated data, low-latency access to evolving state, and the ability to reason over context as it changes.

In this context, “AI” does not mean training or serving models inside Fluss. It refers to the class of intelligent systems that rely on fresh features, evolving context, and real-time state to make decisions continuously. Whether those systems use traditional machine learning models, newer AI techniques, or a combination of both, they all depend on the same data foundations.

This shift explains the recent evolution of Apache Fluss. Investments in stateless compute, richer data types with zero-copy schema evolution, and vector support through Lance were driven by a single question:

What does a data foundation need to look like to support real-time intelligent systems reliably at scale?

The rest of this post answers that question. We’ll explain what AI means when viewed through the lens of Apache Fluss, and why a streaming-first foundation for features, context, and state is central to building the next generation of intelligent systems.

Apache Fluss (Incubating) 0.9 Release Announcement

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)
Jark Wu
PPMC member of Apache Fluss (Incubating)

Banner

🌊 We are excited to announce the official release of Apache Fluss (Incubating) 0.9!

This release marks a major milestone for the project. Fluss 0.9 significantly expands Fluss’s capabilities as a streaming storage system for real-time analytics, AI, and state-heavy streaming workloads, with a strong focus on:

  • Richer and more flexible data models
  • Safe, zero-downtime schema evolution
  • Storage-level optimizations (aggregations, CDC, formats)
  • Stronger operational guarantees and scalability
  • A more mature ecosystem and developer experience

Whether you’re building unified stream & lakehouse architectures, real-time analytics, feature/context stores, or long-running stateful pipelines, Fluss 0.9 introduces powerful new primitives that make these systems easier, safer, and more efficient to operate at scale.


TL;DR: What Fluss 0.9 Unlocks

Features

  • Zero-copy schema evolution for evolving streaming jobs
  • Storage-level aggregations that further enhances zero-state processing
  • Change data feed for CDC, audit trails, point-in-time recovery, and ML reproducibility
  • Safer snapshot-based reads with consumer-aware lifecycle management
  • Operationally robust clusters with automatic rebalancing and safer maintenance workflows
  • Apache Spark integration, enabling unified batch and streaming analytics on Fluss
  • First-class Azure support, allowing Fluss to tier and operate seamlessly on Azure Blob Storage and ADLS Gen2

A fraud detection pipeline with Streamhouse

Jacopo Gardini
Big Data Engineer of Agile Lab SRL

Fraud detection is a mission-critical capability for businesses operating in financial services, e-commerce, and digital payments. Detecting suspicious transactions in real time can prevent significant losses and protect customers. This blog demonstrates how to build a streamhouse that processes bank transactions in real time, detects fraud, and serves data seamlessly across hot (sub‑second latency) and cold (minutes‑latency) layers. Real-time detection and historical analytics are combined, enabling businesses to act quickly while maintaining a complete audit trail.

Fluss × Iceberg (Part 1): Why Your Lakehouse Isn’t a Streamhouse Yet

Mehul Batra
PPMC member of Apache Fluss (Incubating)
Luo Yuxia
PPMC member of Apache Fluss (Incubating)

As software and data engineers, we've witnessed Apache Iceberg revolutionize analytical data lakes with ACID transactions, time travel, and schema evolution. Yet when we try to push Iceberg into real-time workloads such as sub-second streaming queries, high-frequency CDC updates, and primary key semantics, we hit fundamental architectural walls. This blog explores how Fluss × Iceberg integration works and delivers a true real-time lakehouse.

Apache Fluss represents a new architectural approach: the Streamhouse for real-time lakehouses. Instead of stitching together separate streaming and batch systems, the Streamhouse unifies them under a single architecture. In this model, Apache Iceberg continues to serve exactly the role it was designed for: a highly efficient, scalable cold storage layer for analytics, while Fluss fills the missing piece: a hot streaming storage layer with sub-second latency, columnar storage, and built-in primary-key semantics.

After working on Fluss–Iceberg lakehouse integration and deploying this architecture at a massive scale, including Alibaba's 3 PB production deployment processing 40 GB/s, we're ready to share the architectural lessons learned. Specifically, why existing systems fall short, how Fluss and Iceberg naturally complement each other, and what this means for finally building true real-time lakehouses.

Banner

Announcing Apache Fluss (Incubating) 0.8: Streaming Lakehouse for Data + AI

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)
Jark Wu
PPMC member of Apache Fluss (Incubating)

Banner

🌊 We are excited to announce the official release of Apache Fluss (Incubating) 0.8!

This is our first release under the incubator of the Apache Software Foundation, marking a significant milestone in our journey to provide a robust streaming storage platform for real-time analytics.

Over the past four months, the community has made tremendous progress, delivering nearly 400 commits that push the boundaries of the Streaming Lakehouse ecosystem. This release includes multiple stability optimizations and introduces deeper integrations, performance breakthroughs, and next-generation stream processing capabilities. Highlights:

  • 🧊 Enhanced Streaming Lakehouse capabilities with full support for Apache Iceberg and Lance
  • ⚡ Introduction of Delta Joins with Flink, a game-changing innovation that redefines efficiency in stream processing by minimizing state and maximizing speed.
  • 🔧 Supports hot updates for both cluster configurations and table configurations

Apache Fluss 0.8 marks the beginning of a new era in streaming: real-time, unified, and zero-state, purpose-built to power the next generation of data platforms with low-latency performance, scalability, and architectural simplicity.