Version: Next

Building a Streaming Lakehouse

This guide will help you set up a basic Streaming Lakehouse using Fluss with Paimon or Iceberg, and help you better understand the powerful feature of Union Read.

Environment Setup

Prerequisites

Before proceeding with this guide, ensure that Docker and the Docker Compose plugin are installed on your machine. All commands were tested with Docker version 27.4.0 and Docker Compose version v2.30.3.

note

We encourage you to use a recent version of Docker and Compose v2 (however, Compose v1 might work with a few adaptions).

Starting required components

Paimon
Iceberg

We will use docker compose to spin up the required components for this tutorial.

Create a working directory for this guide.

mkdir fluss-quickstart-paimon
cd fluss-quickstart-paimon

Create directories and download required jars:

mkdir -p lib opt

# Flink connectors
wget -O lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar
wget -O "lib/fluss-flink-1.20-0.10-SNAPSHOT.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/0.10-SNAPSHOT/fluss-flink-1.20-0.10-SNAPSHOT.jar"
wget -O "lib/paimon-flink-1.20-1.3.1.jar" "https://repo1.maven.org/maven2/org/apache/paimon/paimon-flink-1.20/1.3.1/paimon-flink-1.20-1.3.1.jar"

# Fluss lake plugin
wget -O "lib/fluss-lake-paimon-0.10-SNAPSHOT.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/0.10-SNAPSHOT/fluss-lake-paimon-0.10-SNAPSHOT.jar"

# Paimon bundle jar
wget -O "lib/paimon-bundle-1.3.1.jar" "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-bundle/1.3.1/paimon-bundle-1.3.1.jar"

# Hadoop bundle jar
wget -O lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar

# Tiering service
wget -O "opt/fluss-flink-tiering-0.10-SNAPSHOT.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/0.10-SNAPSHOT/fluss-flink-tiering-0.10-SNAPSHOT.jar"

info

You can add more jars to this lib directory based on your requirements:

Cloud storage support: For AWS S3 integration with Paimon, add the corresponding paimon-s3
Other catalog backends: Add jars needed for alternative Paimon catalog implementations (e.g., Hive, JDBC)

Create a docker-compose.yml file with the following content:

services:
  coordinator-server:
    image: apache/fluss:0.10-SNAPSHOT
    command: coordinatorServer
    depends_on:
      - zookeeper
    environment:
      - |
        FLUSS_PROPERTIES=
        zookeeper.address: zookeeper:2181
        bind.listeners: FLUSS://coordinator-server:9123
        remote.data.dir: /tmp/fluss/remote-data
        datalake.format: paimon
        datalake.paimon.metastore: filesystem
        datalake.paimon.warehouse: /tmp/paimon
    volumes:
      - shared-tmpfs:/tmp/paimon
      - shared-tmpfs:/tmp/fluss
  tablet-server:
    image: apache/fluss:0.10-SNAPSHOT
    command: tabletServer
    depends_on:
      - coordinator-server
    environment:
      - |
        FLUSS_PROPERTIES=
        zookeeper.address: zookeeper:2181
        bind.listeners: FLUSS://tablet-server:9123
        data.dir: /tmp/fluss/data
        remote.data.dir: /tmp/fluss/remote-data
        kv.snapshot.interval: 0s
        datalake.format: paimon
        datalake.paimon.metastore: filesystem
        datalake.paimon.warehouse: /tmp/paimon
    volumes:
      - shared-tmpfs:/tmp/paimon
      - shared-tmpfs:/tmp/fluss
  zookeeper:
    restart: always
    image: zookeeper:3.9.2
  jobmanager:
    image: flink:1.20-scala_2.12-java17
    ports:
      - "8083:8081"
    entrypoint: ["/bin/bash", "-c"]
    command: >
      "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh &&
       cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true;
       cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true;
       /docker-entrypoint.sh jobmanager"
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
    volumes:
      - shared-tmpfs:/tmp/paimon
      - shared-tmpfs:/tmp/fluss
      - ./lib:/tmp/jars
      - ./opt:/tmp/opt
  taskmanager:
    image: flink:1.20-scala_2.12-java17
    depends_on:
      - jobmanager
    entrypoint: ["/bin/bash", "-c"]
    command: >
      "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh &&
       cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true;
       cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true;
       /docker-entrypoint.sh taskmanager"
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 10
        taskmanager.memory.process.size: 2048m
    volumes:
      - shared-tmpfs:/tmp/paimon
      - shared-tmpfs:/tmp/fluss
      - ./lib:/tmp/jars
      - ./opt:/tmp/opt

volumes:
  shared-tmpfs:
    driver: local
    driver_opts:
      type: "tmpfs"
      device: "tmpfs"

The Docker Compose environment consists of the following containers:

Fluss Cluster: a Fluss CoordinatorServer, a Fluss TabletServer and a ZooKeeper server.
Flink Cluster: a Flink JobManager and a Flink TaskManager container to execute queries.

To start all containers, run:

docker compose up -d

This command automatically starts all the containers defined in the Docker Compose configuration in detached mode.

Run

docker compose ps

to check whether all containers are running properly.

You can also visit http://localhost:8083/ to see if Flink is running normally.

note

If you want to additionally use an observability stack, follow one of the provided quickstart guides here and then continue with this guide.
If you want to run with your own Flink environment, remember to download the fluss-flink connector jar, flink-connector-faker, paimon-flink connector jar and then put them to FLINK_HOME/lib/.
All the following commands involving docker compose should be executed in the created working directory that contains the docker-compose.yml file.

Congratulations, you are all set!

We will use docker compose to spin up the required components for this tutorial.

Create a working directory for this guide.

mkdir fluss-quickstart-iceberg
cd fluss-quickstart-iceberg

Create directories and download required jars:

mkdir -p lib opt

# Flink connectors
wget -O lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar
wget -O "lib/fluss-flink-1.20-0.10-SNAPSHOT.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/0.10-SNAPSHOT/fluss-flink-1.20-0.10-SNAPSHOT.jar"
wget -O lib/iceberg-flink-runtime-1.20-1.10.1.jar "https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.20/1.10.1/iceberg-flink-runtime-1.20-1.10.1.jar"

# Fluss lake plugin
wget -O "lib/fluss-lake-iceberg-0.10-SNAPSHOT.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-iceberg/0.10-SNAPSHOT/fluss-lake-iceberg-0.10-SNAPSHOT.jar"

# Hadoop filesystem support
wget -O lib/hadoop-apache-3.3.5-2.jar https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar
wget -O lib/failsafe-3.3.2.jar https://repo1.maven.org/maven2/dev/failsafe/failsafe/3.3.2/failsafe-3.3.2.jar

# Tiering service
wget -O "opt/fluss-flink-tiering-0.10-SNAPSHOT.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/0.10-SNAPSHOT/fluss-flink-tiering-0.10-SNAPSHOT.jar"

info

You can add more jars to this lib directory based on your requirements:

Cloud storage support: For AWS S3 integration with Iceberg, add the corresponding Iceberg bundle jars (e.g., iceberg-aws-bundle)
Custom Hadoop configurations: Add jars for specific HDFS distributions or custom authentication mechanisms
Other catalog backends: Add jars needed for alternative Iceberg catalog implementations (e.g., Rest, Hive, Glue)

Create a docker-compose.yml file with the following content:

services:
  coordinator-server:
    image: apache/fluss:0.10-SNAPSHOT
    command: coordinatorServer
    depends_on:
      - zookeeper
    environment:
      - |
        FLUSS_PROPERTIES=
        zookeeper.address: zookeeper:2181
        bind.listeners: FLUSS://coordinator-server:9123
        remote.data.dir: /tmp/fluss/remote-data
        datalake.format: iceberg
        datalake.iceberg.type: hadoop
        datalake.iceberg.warehouse: /tmp/iceberg
    volumes:
      - shared-tmpfs:/tmp/iceberg
      - shared-tmpfs:/tmp/fluss
      - ./lib/fluss-lake-iceberg-0.10-SNAPSHOT.jar:/opt/fluss/plugins/iceberg/fluss-lake-iceberg-0.10-SNAPSHOT.jar
      - ./lib/hadoop-apache-3.3.5-2.jar:/opt/fluss/plugins/iceberg/hadoop-apache-3.3.5-2.jar
  tablet-server:
    image: apache/fluss:0.10-SNAPSHOT
    command: tabletServer
    depends_on:
      - coordinator-server
    environment:
      - |
        FLUSS_PROPERTIES=
        zookeeper.address: zookeeper:2181
        bind.listeners: FLUSS://tablet-server:9123
        data.dir: /tmp/fluss/data
        remote.data.dir: /tmp/fluss/remote-data
        kv.snapshot.interval: 0s
        datalake.format: iceberg
        datalake.iceberg.type: hadoop
        datalake.iceberg.warehouse: /tmp/iceberg
    volumes:
      - shared-tmpfs:/tmp/iceberg
      - shared-tmpfs:/tmp/fluss
  zookeeper:
    restart: always
    image: zookeeper:3.9.2
  jobmanager:
    image: flink:1.20-scala_2.12-java17
    ports:
      - "8083:8081"
    entrypoint: ["/bin/bash", "-c"]
    command: >
      "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh &&
       cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true;
       cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true;
       /docker-entrypoint.sh jobmanager"
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
    volumes:
      - shared-tmpfs:/tmp/iceberg
      - shared-tmpfs:/tmp/fluss
      - ./lib:/tmp/jars
      - ./opt:/tmp/opt
  taskmanager:
    image: flink:1.20-scala_2.12-java17
    depends_on:
      - jobmanager
    entrypoint: ["/bin/bash", "-c"]
    command: >
      "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh &&
       cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true;
       cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true;
       /docker-entrypoint.sh taskmanager"
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 10
        taskmanager.memory.process.size: 2048m
    volumes:
      - shared-tmpfs:/tmp/iceberg
      - shared-tmpfs:/tmp/fluss
      - ./lib:/tmp/jars
      - ./opt:/tmp/opt

volumes:
  shared-tmpfs:
    driver: local
    driver_opts:
      type: "tmpfs"
      device: "tmpfs"

The Docker Compose environment consists of the following containers:

Fluss Cluster: a Fluss CoordinatorServer, a Fluss TabletServer and a ZooKeeper server.
Flink Cluster: a Flink JobManager and a Flink TaskManager container to execute queries.

To start all containers, run:

docker compose up -d

This command automatically starts all the containers defined in the Docker Compose configuration in detached mode.

Run

docker compose ps

to check whether all containers are running properly.

You can also visit http://localhost:8083/ to see if Flink is running normally.

note

If you want to additionally use an observability stack, follow one of the provided quickstart guides here and then continue with this guide.
All the following commands involving docker compose should be executed in the created working directory that contains the docker-compose.yml file.

Congratulations, you are all set!

Enter into SQL-Client

Paimon
Iceberg

First, use the following command to enter the Flink SQL CLI Container:

docker compose exec jobmanager ./bin/sql-client.sh

To simplify this guide, we will create three temporary tables with faker connector to generate data:

Flink SQL
CREATE TEMPORARY TABLE source_order (
    `order_key` BIGINT,
    `cust_key` INT,
    `total_price` DECIMAL(15, 2),
    `order_date` DATE,
    `order_priority` STRING,
    `clerk` STRING
) WITH (
  'connector' = 'faker',
  'rows-per-second' = '10',
  'number-of-rows' = '10000',
  'fields.order_key.expression' = '#{number.numberBetween ''0'',''100000000''}',
  'fields.cust_key.expression' = '#{number.numberBetween ''0'',''20''}',
  'fields.total_price.expression' = '#{number.randomDouble ''3'',''1'',''1000''}',
  'fields.order_date.expression' = '#{date.past ''100'' ''DAYS''}',
  'fields.order_priority.expression' = '#{regexify ''(low|medium|high){1}''}',
  'fields.clerk.expression' = '#{regexify ''(Clerk1|Clerk2|Clerk3|Clerk4){1}''}'
);

Flink SQL
CREATE TEMPORARY TABLE source_customer (
    `cust_key` INT,
    `name` STRING,
    `phone` STRING,
    `nation_key` INT NOT NULL,
    `acctbal` DECIMAL(15, 2),
    `mktsegment` STRING,
    PRIMARY KEY (`cust_key`) NOT ENFORCED
) WITH (
  'connector' = 'faker',
  'number-of-rows' = '200',
  'fields.cust_key.expression' = '#{number.numberBetween ''0'',''20''}',
  'fields.name.expression' = '#{funnyName.name}',
  'fields.nation_key.expression' = '#{number.numberBetween ''1'',''5''}',
  'fields.phone.expression' = '#{phoneNumber.cellPhone}',
  'fields.acctbal.expression' = '#{number.randomDouble ''3'',''1'',''1000''}',
  'fields.mktsegment.expression' = '#{regexify ''(AUTOMOBILE|BUILDING|FURNITURE|MACHINERY|HOUSEHOLD){1}''}'
);

Flink SQL
CREATE TEMPORARY TABLE `source_nation` (
  `nation_key` INT NOT NULL,
  `name` STRING,
   PRIMARY KEY (`nation_key`) NOT ENFORCED
) WITH (
  'connector' = 'faker',
  'number-of-rows' = '100',
  'fields.nation_key.expression' = '#{number.numberBetween ''1'',''5''}',
  'fields.name.expression' = '#{regexify ''(CANADA|JORDAN|CHINA|UNITED|INDIA){1}''}'
);

Flink SQL
-- drop records silently if a null value would have to be inserted into a NOT NULL column
SET 'table.exec.sink.not-null-enforcer'='DROP';

First, use the following command to enter the Flink SQL CLI Container:

docker compose exec jobmanager ./bin/sql-client.sh

To simplify this guide, we will create three temporary tables with faker connector to generate data:

Flink SQL
CREATE TEMPORARY TABLE source_order (
    `order_key` BIGINT,
    `cust_key` INT,
    `total_price` DECIMAL(15, 2),
    `order_date` DATE,
    `order_priority` STRING,
    `clerk` STRING
) WITH (
  'connector' = 'faker',
  'rows-per-second' = '10',
  'number-of-rows' = '10000',
  'fields.order_key.expression' = '#{number.numberBetween ''0'',''100000000''}',
  'fields.cust_key.expression' = '#{number.numberBetween ''0'',''20''}',
  'fields.total_price.expression' = '#{number.randomDouble ''3'',''1'',''1000''}',
  'fields.order_date.expression' = '#{date.past ''100'' ''DAYS''}',
  'fields.order_priority.expression' = '#{regexify ''(low|medium|high){1}''}',
  'fields.clerk.expression' = '#{regexify ''(Clerk1|Clerk2|Clerk3|Clerk4){1}''}'
);

Flink SQL
CREATE TEMPORARY TABLE source_customer (
    `cust_key` INT,
    `name` STRING,
    `phone` STRING,
    `nation_key` INT NOT NULL,
    `acctbal` DECIMAL(15, 2),
    `mktsegment` STRING,
    PRIMARY KEY (`cust_key`) NOT ENFORCED
) WITH (
  'connector' = 'faker',
  'number-of-rows' = '200',
  'fields.cust_key.expression' = '#{number.numberBetween ''0'',''20''}',
  'fields.name.expression' = '#{funnyName.name}',
  'fields.nation_key.expression' = '#{number.numberBetween ''1'',''5''}',
  'fields.phone.expression' = '#{phoneNumber.cellPhone}',
  'fields.acctbal.expression' = '#{number.randomDouble ''3'',''1'',''1000''}',
  'fields.mktsegment.expression' = '#{regexify ''(AUTOMOBILE|BUILDING|FURNITURE|MACHINERY|HOUSEHOLD){1}''}'
);

Flink SQL
CREATE TEMPORARY TABLE `source_nation` (
  `nation_key` INT NOT NULL,
  `name`       STRING,
   PRIMARY KEY (`nation_key`) NOT ENFORCED
) WITH (
  'connector' = 'faker',
  'number-of-rows' = '100',
  'fields.nation_key.expression' = '#{number.numberBetween ''1'',''5''}',
  'fields.name.expression' = '#{regexify ''(CANADA|JORDAN|CHINA|UNITED|INDIA){1}''}'
);

Flink SQL
-- drop records silently if a null value would have to be inserted into a NOT NULL column
SET 'table.exec.sink.not-null-enforcer'='DROP';

Create Fluss Tables

Create Fluss Catalog

Use the following SQL to create a Fluss catalog:

Flink SQL
CREATE CATALOG fluss_catalog WITH (
    'type' = 'fluss',
    'bootstrap.servers' = 'coordinator-server:9123'
);

Flink SQL
USE CATALOG fluss_catalog;

info

By default, catalog configurations are not persisted across Flink SQL client sessions. For further information how to store catalog configurations, see Flink's Catalog Store.

Create Tables

Paimon
Iceberg

Running the following SQL to create Fluss tables to be used in this guide:

Flink SQL
CREATE TABLE fluss_order (
    `order_key` BIGINT,
    `cust_key` INT NOT NULL,
    `total_price` DECIMAL(15, 2),
    `order_date` DATE,
    `order_priority` STRING,
    `clerk` STRING,
    `ptime` AS PROCTIME(),
    PRIMARY KEY (`order_key`) NOT ENFORCED
);

Flink SQL
CREATE TABLE fluss_customer (
    `cust_key` INT NOT NULL,
    `name` STRING,
    `phone` STRING,
    `nation_key` INT NOT NULL,
    `acctbal` DECIMAL(15, 2),
    `mktsegment` STRING,
    PRIMARY KEY (`cust_key`) NOT ENFORCED
);

Flink SQL
CREATE TABLE fluss_nation (
  `nation_key` INT NOT NULL,
  `name`       STRING,
   PRIMARY KEY (`nation_key`) NOT ENFORCED
);

Running the following SQL to create Fluss tables to be used in this guide:

Flink SQL
CREATE TABLE fluss_order (
    `order_key` BIGINT,
    `cust_key` INT NOT NULL,
    `total_price` DECIMAL(15, 2),
    `order_date` DATE,
    `order_priority` STRING,
    `clerk` STRING,
    `ptime` AS PROCTIME()
);

Flink SQL
CREATE TABLE fluss_customer (
    `cust_key` INT NOT NULL,
    `name` STRING,
    `phone` STRING,
    `nation_key` INT NOT NULL,
    `acctbal` DECIMAL(15, 2),
    `mktsegment` STRING,
    PRIMARY KEY (`cust_key`) NOT ENFORCED
);

Flink SQL
CREATE TABLE fluss_nation (
  `nation_key` INT NOT NULL,
  `name`       STRING,
   PRIMARY KEY (`nation_key`) NOT ENFORCED
);

Streaming into Fluss

First, run the following SQL to sync data from source tables to Fluss tables:

Flink SQL
EXECUTE STATEMENT SET
BEGIN
    INSERT INTO fluss_nation SELECT * FROM `default_catalog`.`default_database`.source_nation;
    INSERT INTO fluss_customer SELECT * FROM `default_catalog`.`default_database`.source_customer;
    INSERT INTO fluss_order SELECT * FROM `default_catalog`.`default_database`.source_order;
END;

Lakehouse Integration

Start the Lakehouse Tiering Service

Paimon
Iceberg

To integrate with Apache Paimon, you need to start the Lakehouse Tiering Service. Open a new terminal, navigate to the fluss-quickstart-paimon directory, and execute the following command within this directory to start the service:

docker compose exec jobmanager \
    /opt/flink/bin/flink run \
    /opt/flink/opt/fluss-flink-tiering-0.10-SNAPSHOT.jar \
    --fluss.bootstrap.servers coordinator-server:9123 \
    --datalake.format paimon \
    --datalake.paimon.metastore filesystem \
    --datalake.paimon.warehouse /tmp/paimon

You should see a Flink Job to tier data from Fluss to Paimon running in the Flink Web UI.

To integrate with Apache Iceberg, you need to start the Lakehouse Tiering Service. Open a new terminal, navigate to the fluss-quickstart-iceberg directory, and execute the following command within this directory to start the service:

docker compose exec jobmanager \
    /opt/flink/bin/flink run \
    /opt/flink/opt/fluss-flink-tiering.jar \
    --fluss.bootstrap.servers coordinator-server:9123 \
    --datalake.format iceberg \
    --datalake.iceberg.type hadoop \
    --datalake.iceberg.warehouse /tmp/iceberg

You should see a Flink Job to tier data from Fluss to Iceberg running in the Flink Web UI.

Streaming into Fluss datalake-enabled tables

Paimon
Iceberg

By default, tables are created with data lake integration disabled, meaning the Lakehouse Tiering Service will not tier the table's data to the data lake.

To enable lakehouse functionality as a tiered storage solution for a table, you must create the table with the configuration option table.datalake.enabled = true. Return to the SQL client and execute the following SQL statement to create a table with data lake integration enabled:

Flink SQL
CREATE TABLE datalake_enriched_orders (
    `order_key` BIGINT,
    `cust_key` INT NOT NULL,
    `total_price` DECIMAL(15, 2),
    `order_date` DATE,
    `order_priority` STRING,
    `clerk` STRING,
    `cust_name` STRING,
    `cust_phone` STRING,
    `cust_acctbal` DECIMAL(15, 2),
    `cust_mktsegment` STRING,
    `nation_name` STRING,
    PRIMARY KEY (`order_key`) NOT ENFORCED
) WITH (
    'table.datalake.enabled' = 'true',
    'table.datalake.freshness' = '30s'
);

Next, perform streaming data writing into the datalake-enabled table, datalake_enriched_orders:

Flink SQL
-- insert tuples into datalake_enriched_orders
INSERT INTO datalake_enriched_orders
SELECT o.order_key,
       o.cust_key,
       o.total_price,
       o.order_date,
       o.order_priority,
       o.clerk,
       c.name,
       c.phone,
       c.acctbal,
       c.mktsegment,
       n.name
FROM fluss_order o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
    ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime` AS `n`
    ON c.nation_key = n.nation_key;

By default, tables are created with data lake integration disabled, meaning the Lakehouse Tiering Service will not tier the table's data to the data lake.

Flink SQL
CREATE TABLE datalake_enriched_orders (
    `order_key` BIGINT,
    `cust_key` INT NOT NULL,
    `total_price` DECIMAL(15, 2),
    `order_date` DATE,
    `order_priority` STRING,
    `clerk` STRING,
    `cust_name` STRING,
    `cust_phone` STRING,
    `cust_acctbal` DECIMAL(15, 2),
    `cust_mktsegment` STRING,
    `nation_name` STRING
) WITH (
    'table.datalake.enabled' = 'true',
    'table.datalake.freshness' = '30s'
);

Next, perform streaming data writing into the datalake-enabled table, datalake_enriched_orders:

Flink SQL
-- insert tuples into datalake_enriched_orders
INSERT INTO datalake_enriched_orders
SELECT o.order_key,
       o.cust_key,
       o.total_price,
       o.order_date,
       o.order_priority,
       o.clerk,
       c.name,
       c.phone,
       c.acctbal,
       c.mktsegment,
       n.name
FROM fluss_order o
LEFT JOIN fluss_customer FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
    ON o.cust_key = c.cust_key
LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF `o`.`ptime` AS `n`
    ON c.nation_key = n.nation_key;

Real-Time Analytics on Fluss datalake-enabled Tables

Paimon
Iceberg

The data for the datalake_enriched_orders table is stored in Fluss (for real-time data) and Paimon (for historical data).

When querying the datalake_enriched_orders table, Fluss uses a union operation that combines data from both Fluss and Paimon to provide a complete result set -- combines real-time and historical data.

If you wish to query only the data stored in Paimon—offering high-performance access without the overhead of unioning data—you can use the datalake_enriched_orders$lake table by appending the $lake suffix. This approach also enables all the optimizations and features of a Flink Paimon table source, including system table such as datalake_enriched_orders$lake$snapshots.

To query the snapshots directly from Paimon, use the following SQL:

Flink SQL
-- use tableau result mode
SET 'sql-client.execution.result-mode' = 'tableau';

Flink SQL
-- switch to batch mode
SET 'execution.runtime-mode' = 'batch';

Flink SQL
-- query snapshots in paimon
SELECT snapshot_id, total_record_count FROM datalake_enriched_orders$lake$snapshots;

Sample Output:

+-------------+--------------------+
| snapshot_id | total_record_count |
+-------------+--------------------+
|           1 |                650 |
+-------------+--------------------+

Note: Make sure to wait for the configured datalake.freshness (~30s) to complete before querying the snapshots, otherwise the result will be empty.

Run the following SQL to do analytics on Paimon data:

Flink SQL
-- to sum prices of all orders in paimon
SELECT sum(total_price) as sum_price FROM datalake_enriched_orders$lake;

Sample Output:

+------------+
|  sum_price |
+------------+
| 1669519.92 |
+------------+

To achieve results with sub-second data freshness, you can query the table directly, which seamlessly unifies data from both Fluss and Paimon:

Flink SQL
-- to sum prices of all orders in fluss and paimon
SELECT sum(total_price) as sum_price FROM datalake_enriched_orders;

The result looks like:

+------------+
|  sum_price |
+------------+
| 1777908.36 |
+------------+

You can execute the real-time analytics query multiple times, and the results will vary with each run as new data is continuously written to Fluss in real-time.

The files adhere to Paimon's standard format, enabling seamless querying with other engines such as Spark and Trino.

The data for the datalake_enriched_orders table is stored in Fluss (for real-time data) and Iceberg (for historical data).

When querying the datalake_enriched_orders table, Fluss uses a union operation that combines data from both Fluss and Iceberg to provide a complete result set -- combines real-time and historical data.

If you wish to query only the data stored in Iceberg—offering high-performance access without the overhead of unioning data—you can use the datalake_enriched_orders$lake table by appending the $lake suffix. This approach also enables all the optimizations and features of a Flink Iceberg table source, including system table such as datalake_enriched_orders$lake$snapshots.

Flink SQL
-- use tableau result mode
SET 'sql-client.execution.result-mode' = 'tableau';

Flink SQL
-- switch to batch mode
SET 'execution.runtime-mode' = 'batch';

Flink SQL
-- query snapshots in iceberg
SELECT snapshot_id, operation FROM datalake_enriched_orders$lake$snapshots;

Sample Output:

+---------------------+-----------+
|         snapshot_id | operation |
+---------------------+-----------+
| 7792523713868625335 |    append |
| 7960217942125627573 |    append |
+---------------------+-----------+

Note: Make sure to wait for the configured datalake.freshness (~30s) to complete before querying the snapshots, otherwise the result will be empty.

Run the following SQL to do analytics on Iceberg data:

Flink SQL
-- to sum prices of all orders in iceberg
SELECT sum(total_price) as sum_price FROM datalake_enriched_orders$lake;

Sample Output:

+-----------+
| sum_price |
+-----------+
| 432880.93 |
+-----------+

To achieve results with sub-second data freshness, you can query the table directly, which seamlessly unifies data from both Fluss and Iceberg:

Flink SQL
-- to sum prices of all orders (combining fluss and iceberg data)
SELECT sum(total_price) as sum_price FROM datalake_enriched_orders;

Sample Output:

+-----------+
| sum_price |
+-----------+
| 558660.03 |
+-----------+

You can execute the real-time analytics query multiple times, and the results will vary with each run as new data is continuously written to Fluss in real-time.

The files adhere to Iceberg's standard format, enabling seamless querying with other engines such as Spark and Trino.

Clean up

After finishing the tutorial, run exit to exit Flink SQL CLI Container and then run

docker compose down -v

to stop all containers.

Environment Setup​

Prerequisites​

Starting required components​

Enter into SQL-Client​

Create Fluss Tables​

Create Fluss Catalog​

Create Tables​

Streaming into Fluss​

Lakehouse Integration​

Start the Lakehouse Tiering Service​

Streaming into Fluss datalake-enabled tables​

Real-Time Analytics on Fluss datalake-enabled Tables​

Clean up​

Environment Setup

Prerequisites

Starting required components

Enter into SQL-Client

Create Fluss Tables

Create Fluss Catalog

Create Tables

Streaming into Fluss

Lakehouse Integration

Start the Lakehouse Tiering Service

Streaming into Fluss datalake-enabled tables

Real-Time Analytics on Fluss datalake-enabled Tables

Clean up