Skip to main content

What does Apache Fluss mean in the context of AI?

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

The Data Foundation for Real-Time Intelligent Systems

Apache Fluss (Incubating) started as streaming storage for real-time analytics, built to work closely with stream processors like Apache Flink. Its focus has always been on freshness, efficient analytical access, and continuous data, making fast-changing streams directly usable without forcing them through batch-oriented systems or log-only pipelines.

Over the last year, Fluss has expanded beyond this original framing. You’ll now see it described as streaming storage for real-time analytics and AI. This change reflects how data systems are being used today: more workloads depend on continuously updated data, low-latency access to evolving state, and the ability to reason over context as it changes.

In this context, “AI” does not mean training or serving models inside Fluss. It refers to the class of intelligent systems that rely on fresh features, evolving context, and real-time state to make decisions continuously. Whether those systems use traditional machine learning models, newer AI techniques, or a combination of both, they all depend on the same data foundations.

This shift explains the recent evolution of Apache Fluss. Investments in stateless compute, richer data types with zero-copy schema evolution, and vector support through Lance were driven by a single question:

What does a data foundation need to look like to support real-time intelligent systems reliably at scale?

The rest of this post answers that question. We’ll explain what AI means when viewed through the lens of Apache Fluss, and why a streaming-first foundation for features, context, and state is central to building the next generation of intelligent systems.

Apache Fluss (Incubating) 0.9 Release Announcement

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)
Jark Wu
PPMC member of Apache Fluss (Incubating)

Banner

🌊 We are excited to announce the official release of Apache Fluss (Incubating) 0.9!

This release marks a major milestone for the project. Fluss 0.9 significantly expands Fluss’s capabilities as a streaming storage system for real-time analytics, AI, and state-heavy streaming workloads, with a strong focus on:

  • Richer and more flexible data models
  • Safe, zero-downtime schema evolution
  • Storage-level optimizations (aggregations, CDC, formats)
  • Stronger operational guarantees and scalability
  • A more mature ecosystem and developer experience

Whether you’re building unified stream & lakehouse architectures, real-time analytics, feature/context stores, or long-running stateful pipelines, Fluss 0.9 introduces powerful new primitives that make these systems easier, safer, and more efficient to operate at scale.


TL;DR: What Fluss 0.9 Unlocks

Features

  • Zero-copy schema evolution for evolving streaming jobs
  • Storage-level aggregations that further enhances zero-state processing
  • Change data feed for CDC, audit trails, point-in-time recovery, and ML reproducibility
  • Safer snapshot-based reads with consumer-aware lifecycle management
  • Operationally robust clusters with automatic rebalancing and safer maintenance workflows
  • Apache Spark integration, enabling unified batch and streaming analytics on Fluss
  • First-class Azure support, allowing Fluss to tier and operate seamlessly on Azure Blob Storage and ADLS Gen2

A fraud detection pipeline with Streamhouse

Jacopo Gardini
Big Data Engineer of Agile Lab SRL

Fraud detection is a mission-critical capability for businesses operating in financial services, e-commerce, and digital payments. Detecting suspicious transactions in real time can prevent significant losses and protect customers. This blog demonstrates how to build a streamhouse that processes bank transactions in real time, detects fraud, and serves data seamlessly across hot (sub‑second latency) and cold (minutes‑latency) layers. Real-time detection and historical analytics are combined, enabling businesses to act quickly while maintaining a complete audit trail.

Fluss × Iceberg (Part 1): Why Your Lakehouse Isn’t a Streamhouse Yet

Mehul Batra
Apache Fluss (Incubating) Committer
Luo Yuxia
PPMC member of Apache Fluss (Incubating)

As software and data engineers, we've witnessed Apache Iceberg revolutionize analytical data lakes with ACID transactions, time travel, and schema evolution. Yet when we try to push Iceberg into real-time workloads such as sub-second streaming queries, high-frequency CDC updates, and primary key semantics, we hit fundamental architectural walls. This blog explores how Fluss × Iceberg integration works and delivers a true real-time lakehouse.

Apache Fluss represents a new architectural approach: the Streamhouse for real-time lakehouses. Instead of stitching together separate streaming and batch systems, the Streamhouse unifies them under a single architecture. In this model, Apache Iceberg continues to serve exactly the role it was designed for: a highly efficient, scalable cold storage layer for analytics, while Fluss fills the missing piece: a hot streaming storage layer with sub-second latency, columnar storage, and built-in primary-key semantics.

After working on Fluss–Iceberg lakehouse integration and deploying this architecture at a massive scale, including Alibaba's 3 PB production deployment processing 40 GB/s, we're ready to share the architectural lessons learned. Specifically, why existing systems fall short, how Fluss and Iceberg naturally complement each other, and what this means for finally building true real-time lakehouses.

Banner

Announcing Apache Fluss (Incubating) 0.8: Streaming Lakehouse for Data + AI

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)
Jark Wu
PPMC member of Apache Fluss (Incubating)

Banner

🌊 We are excited to announce the official release of Apache Fluss (Incubating) 0.8!

This is our first release under the incubator of the Apache Software Foundation, marking a significant milestone in our journey to provide a robust streaming storage platform for real-time analytics.

Over the past four months, the community has made tremendous progress, delivering nearly 400 commits that push the boundaries of the Streaming Lakehouse ecosystem. This release includes multiple stability optimizations and introduces deeper integrations, performance breakthroughs, and next-generation stream processing capabilities. Highlights:

  • 🧊 Enhanced Streaming Lakehouse capabilities with full support for Apache Iceberg and Lance
  • ⚡ Introduction of Delta Joins with Flink, a game-changing innovation that redefines efficiency in stream processing by minimizing state and maximizing speed.
  • 🔧 Supports hot updates for both cluster configurations and table configurations

Apache Fluss 0.8 marks the beginning of a new era in streaming: real-time, unified, and zero-state, purpose-built to power the next generation of data platforms with low-latency performance, scalability, and architectural simplicity.

Primary Key Tables: Unifying Log and Cache for 🚀 Streaming

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Modern data platforms have traditionally relied on two foundational components: a log for durable, ordered event storage and a cache for low-latency access. Common architectures include combinations such as Kafka with Redis, or Debezium feeding changes into a key-value store. While these patterns underpin a significant portion of production infrastructure, they also introduce complexity, fragility, and operational overhead.

Apache Fluss (Incubating) addresses this challenge with an elegant solution: Primary Key Tables (PK Tables). These persistent state tables provide the same semantics as running both a log and a cache, without needing two separate systems. Every write produces a durable log entry and an immediately consistent key-value update. Snapshots and log replay guarantee deterministic recovery, while clients benefit from the simplicity of interacting with one system for reads, writes, and queries.

In this post, we will explore how Fluss PK Tables work, why unifying log and cache into a persistent design is a critical advancement, and how this model resolves long-standing challenges of maintaining consistency across multiple systems.

How Taobao uses Apache Fluss (Incubating) for Real-Time Processing in Search and RecSys

Xinyu Zhang
Xinyu Zhang
Senior Data Development Engineer of Taotian Group
Lilei Wang
Lilei Wang
Data Development Engineer of Taotian Group

Streaming Storage More Suitable for Real-Time OLAP

Introduction

The Data Development Team of Taobao has built a new generation of real-time data warehouse based on Apache Fluss. Fluss solves the problems of redundant data transfer, difficulties in data profiling, and challenges in large scale stateful workload operations and maintenance. By combining columnar storage with real-time update capabilities, Fluss supports column pruning, key-value point lookups, Delta Join, and seamless lake–stream integration, thereby cutting I/O and compute overhead while enhancing job stability and profiling efficiency.

Already deployed on Taobao’s A/B-testing platform for critical services such as search and recommendation, the system proved its resilience during the 618 Grand Promotion: it handled tens of millions of requests with sub-second latency, lowered resource usage by 30%, and removed more than 100 TB from state storage. Looking ahead, the team will continue to extend Fluss within a Lakehouse architecture and broaden its use across AI-driven workloads.

From Stream to Lake: Hands-On with Fluss Tiering into Paimon on Minio

Yang Guo
Contributor of Apache Fluss (Incubating)

Fluss stores historical data in a lakehouse storage layer while keeping real-time data in the Fluss server. Its built-in tiering service continuously moves fresh events into the lakehouse, allowing various query engines to analyze both hot and cold data. The real magic happens with Fluss's union-read capability, which lets Flink jobs seamlessly query both the Fluss cluster and the lakehouse for truly integrated real-time processing.

In this hands-on tutorial, we'll walk you through setting up a local Fluss lakehouse environment, running some practical data operations, and getting first-hand experience with the complete Fluss lakehouse architecture. By the end, you'll have a working environment for experimenting with Fluss's powerful data processing capabilities.

Fluss Joins the Apache Incubator

Jark Wu
PPMC member of Apache Fluss (Incubating)

On June 5th, Fluss, the next-generation streaming storage project open-sourced and donated by Alibaba, successfully passed the vote and officially became an incubator project of the Apache Software Foundation (ASF). This marks a significant milestone in the development of the Fluss community, symbolizing that the project has entered a new phase that is more open, neutral, and standardized. Moving forward, Fluss will leverage the ASF ecosystem to accelerate the building of a global developer community, continuously driving innovation and adoption of next-generation real-time data infrastructure.

ASF

Apache Fluss Java Client: A Deep Dive

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Banner

Introduction

Apache Fluss is a streaming data storage system built for real-time analytics, serving as a low-latency data layer in modern data Lakehouses. It supports sub-second streaming reads and writes, storing data in a columnar format for efficiency, and offers two flexible table types: append-only Log Tables and updatable Primary Key Tables. In practice, this means Fluss can ingest high-throughput event streams (using log tables) while also maintaining up-to-date reference data or state (using primary key tables), a combination ideal for scenarios like IoT, where you might stream sensor readings and look up information for those sensors in real-time, without the need for external K/V stores.