Skip to main content

How Taobao uses Apache Fluss (Incubating) for Real-Time Processing in Search and RecSys

Xinyu Zhang
Xinyu Zhang
Senior Data Development Engineer of Taotian Group
Lilei Wang
Lilei Wang
Data Development Engineer of Taotian Group

Streaming Storage More Suitable for Real-Time OLAP

Introduction

The Data Development Team of Taobao has built a new generation of real-time data warehouse based on Apache Fluss. Fluss solves the problems of redundant data transfer, difficulties in data profiling, and challenges in large scale stateful workload operations and maintenance. By combining columnar storage with real-time update capabilities, Fluss supports column pruning, key-value point lookups, Delta Join, and seamless lake–stream integration, thereby cutting I/O and compute overhead while enhancing job stability and profiling efficiency.

Already deployed on Taobao’s A/B-testing platform for critical services such as search and recommendation, the system proved its resilience during the 618 Grand Promotion: it handled tens of millions of requests with sub-second latency, lowered resource usage by 30%, and removed more than 100 TB from state storage. Looking ahead, the team will continue to extend Fluss within a Lakehouse architecture and broaden its use across AI-driven workloads.

From Stream to Lake: Hands-On with Fluss Tiering into Paimon on Minio

Yang Guo
Contributor of Apache Fluss (Incubating)

Fluss stores historical data in a lakehouse storage layer while keeping real-time data in the Fluss server. Its built-in tiering service continuously moves fresh events into the lakehouse, allowing various query engines to analyze both hot and cold data. The real magic happens with Fluss's union-read capability, which lets Flink jobs seamlessly query both the Fluss cluster and the lakehouse for truly integrated real-time processing.

In this hands-on tutorial, we'll walk you through setting up a local Fluss lakehouse environment, running some practical data operations, and getting first-hand experience with the complete Fluss lakehouse architecture. By the end, you'll have a working environment for experimenting with Fluss's powerful data processing capabilities.

Fluss Joins the Apache Incubator

Jark Wu
PPMC member of Apache Fluss (Incubating)

On June 5th, Fluss, the next-generation streaming storage project open-sourced and donated by Alibaba, successfully passed the vote and officially became an incubator project of the Apache Software Foundation (ASF). This marks a significant milestone in the development of the Fluss community, symbolizing that the project has entered a new phase that is more open, neutral, and standardized. Moving forward, Fluss will leverage the ASF ecosystem to accelerate the building of a global developer community, continuously driving innovation and adoption of next-generation real-time data infrastructure.

ASF

Apache Fluss Java Client: A Deep Dive

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Banner

Introduction

Apache Fluss is a streaming data storage system built for real-time analytics, serving as a low-latency data layer in modern data Lakehouses. It supports sub-second streaming reads and writes, storing data in a columnar format for efficiency, and offers two flexible table types: append-only Log Tables and updatable Primary Key Tables. In practice, this means Fluss can ingest high-throughput event streams (using log tables) while also maintaining up-to-date reference data or state (using primary key tables), a combination ideal for scenarios like IoT, where you might stream sensor readings and look up information for those sensors in real-time, without the need for external K/V stores.

Tiering Service Deep Dive

Yang Guo
Contributor of Apache Fluss (Incubating)

Background

At the core of Fluss’s Lakehouse architecture sits the Tiering Service: a smart, policy-driven data pipeline that seamlessly bridges your real-time Fluss cluster and your cost-efficient lakehouse storage. It continuously ingests fresh events from the fluss cluster, automatically migrating older or less-frequently accessed data into colder storage tiers without interrupting ongoing queries. By balancing hot, warm, and cold storage according to configurable rules, the Tiering Service ensures that recent data remains instantly queryable while historical records are archived economically.

In this blog post we will take a deep dive and explore how Fluss’s Tiering Service orchestrates data movement, preserves consistency, and empowers scalable, high-performance analytics at optimized costs.

Announcing Fluss 0.7

Jark Wu
PPMC member of Apache Fluss (Incubating)

Banner

🌊 We are excited to announce the official release of Fluss 0.7!

This version has undergone extensive improvements in stability, architecture, performance optimization, and security, further enhancing its readiness for production environments. Over the past three months, we have completed more than 250 commits, making this release a significant milestone toward becoming a mature, production-grade streaming storage platform.

Understanding Partial Updates

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Banner

Traditional streaming data pipelines often need to join many tables or streams on a primary key to create a wide view. For example, imagine you’re building a real-time recommendation engine for an e-commerce platform. To serve highly personalized recommendations, your system needs a complete 360° view of each user, including: user preferences, past purchases, clickstream behavior, cart activity, product reviews, support tickets, ad impressions, and loyalty status.

That’s at least 8 different data sources, each producing updates independently.

The Story of Fluss Logo

Jark Wu
PPMC member of Apache Fluss (Incubating)

Introducing the Little Otter

Today is World Otter Day, and we are thrilled to introduce the little otter to the Fluss community! 🎉

Since open-sourced half a year ago, many community members and friends have asked us: "When will Fluss get a logo?" After more than a month of careful design work and over 30 iterations, we’re excited to finally unveil the official Fluss logo — a surfing otter! 🦦🌊

Announcing Fluss 0.6

Jark Wu
PPMC member of Apache Fluss (Incubating)

The Fluss community is pleased to announce the official release of Fluss 0.6.0. This version has undergone over three months of intensive development, bringing together the expertise and efforts of 45 contributors worldwide, with more than 200 code commits completed. Our heartfelt thanks go out to every contributor for their invaluable support!

Release Announcement

Towards A Unified Streaming & Lakehouse Architecture

Luo Yuxia
PPMC member of Apache Fluss (Incubating)

The unification of Lakehouse and streaming storage represents a major trend in the future development of modern data lakes and streaming storage systems. Designed specifically for real-time analytics, Fluss has embraced a unified Streaming and Lakehouse architecture from its inception, enabling seamless integration into existing Lakehouse architectures.

Fluss is designed to address the demands of real-time analytics with the following key capabilities:

  • Real-Time Stream Reading and Writing: Supports millisecond-level end-to-end latency.
  • Columnar Stream: Optimizes storage and query efficiency.
  • Streaming Updates: Enables low-latency updates to data streams.
  • Changelog Generation: Supports changelog generation and consumption.
  • Real-Time Lookup Queries: Facilitates instant lookup queries on primary keys.
  • Streaming & Lakehouse Unification: Seamlessly integrates streaming and lakehouse storage for unified data processing.