Skip to main content
Version: Next

Lakekeeper

Introduction

Lakekeeper is an Apache-licensed Iceberg REST Catalog written in Rust. It provides a standards-compliant Iceberg REST Catalog interface, making tiered Iceberg tables discoverable and queryable by any Iceberg-compatible engine.

This guide explains how to configure Fluss to use Lakekeeper as its Iceberg catalog. For general Iceberg integration details (table mapping, data types, limitations), see Iceberg.

Quick Start with Docker Compose

A complete Docker Compose example including Fluss, the tiering service, and Lakekeeper is available in the Lakekeeper examples directory. It demonstrates table creation, data ingestion, tiering, and querying with DuckDB via the REST catalog.

How It Works

When Fluss is configured with Lakekeeper as its Iceberg REST catalog:

  1. Fluss creates and manages Iceberg table metadata through Lakekeeper's REST API
  2. The tiering service writes data to object storage and commits snapshots via Lakekeeper
  3. Any Iceberg-compatible engine (Flink, Spark, Trino, StarRocks, etc.) can discover and query the tiered tables through Lakekeeper

Prerequisites

Running Lakekeeper Instance

You need a running Lakekeeper instance with a warehouse configured. Refer to the Lakekeeper Getting Started guide for deployment instructions.

After deploying Lakekeeper, create a warehouse that points to your object storage (e.g., S3):

# Bootstrap Lakekeeper (first-time setup)
curl -X POST http://<lakekeeper-host>:8181/management/v1/bootstrap \
-H "Content-Type: application/json" \
-d '{"accept-terms-of-use": true}'

# Create a warehouse
curl -X POST http://<lakekeeper-host>:8181/management/v1/warehouse \
-H "Content-Type: application/json" \
-d '{
"warehouse-name": "my-warehouse",
"project-id": "00000000-0000-0000-0000-000000000000",
"storage-profile": {
"type": "s3",
"bucket": "<your-bucket>",
"key-prefix": "iceberg",
"endpoint": "http://<s3-endpoint>:9000",
"region": "us-east-1",
"path-style-access": true
},
"storage-credential": {
"type": "s3",
"credential-type": "access-key",
"aws-access-key-id": "<access-key>",
"aws-secret-access-key": "<secret-key>"
}
}'

NOTE: Adjust the storage-profile and storage-credential to match your storage backend. Lakekeeper supports S3, GCS, Azure, and local storage. See Lakekeeper Storage for details.

Configure Fluss with Lakekeeper

Cluster Configuration

Add the following to your server.yaml:

datalake.format: iceberg
datalake.iceberg.type: rest
datalake.iceberg.uri: http://<lakekeeper-host>:8181/catalog
datalake.iceberg.warehouse: <warehouse-name>

Fluss strips the datalake.iceberg. prefix and passes the remaining properties to the Iceberg REST catalog client. You can add any additional Iceberg REST catalog properties using the same prefix. For example:

# Optional: pass additional REST catalog properties
datalake.iceberg.header.Authorization: Bearer <token>

Hadoop Dependencies

Some FileIO implementations require Hadoop classes. Place the pre-bundled Hadoop JAR into FLUSS_HOME/plugins/iceberg/:

wget -P ${FLUSS_HOME}/plugins/iceberg/ \
https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar

See Iceberg - Hadoop Dependencies for alternative approaches.

Start Tiering Service

Follow the Iceberg tiering service setup to prepare the required JARs and start the tiering service. Use REST catalog parameters when launching the Flink tiering job:

${FLINK_HOME}/bin/flink run /path/to/fluss-flink-tiering-1.0-SNAPSHOT.jar \
--fluss.bootstrap.servers <coordinator-host>:9123 \
--datalake.format iceberg \
--datalake.iceberg.type rest \
--datalake.iceberg.uri http://<lakekeeper-host>:8181/catalog \
--datalake.iceberg.warehouse <warehouse-name>

Usage Example

Create a Datalake-Enabled Table

Flink SQL
USE CATALOG fluss_catalog;

CREATE TABLE orders (
`order_id` BIGINT,
`customer_id` INT NOT NULL,
`total_price` DECIMAL(15, 2),
`order_date` DATE,
`status` STRING,
PRIMARY KEY (`order_id`) NOT ENFORCED
) WITH (
'table.datalake.enabled' = 'true',
'table.datalake.freshness' = '30s'
);

Once the tiering service is running, Fluss automatically creates the corresponding Iceberg table in Lakekeeper and begins tiering data.

Query Data

Flink SQL
SET 'execution.runtime-mode' = 'batch';

-- Union read: combines fresh data in Fluss with historical data in Iceberg
SELECT COUNT(*) FROM orders;

For details on union reads, streaming reads, and reading with other engines, see Iceberg - Read Tables.

Further Reading