Graceful Shutdown
Apache Fluss provides a comprehensive graceful shutdown mechanism to ensure data integrity and proper resource cleanup when stopping servers or services.
This guide describes the shutdown procedures, configuration options, and best practices for each Fluss component.
Overview
Graceful shutdown in Fluss ensures that:
- All ongoing operations complete safely
- Resources are properly released
- Data consistency is maintained
- Network connections are cleanly closed
- Background tasks are terminated properly
These guarantees prevent data corruption and ensure smooth restarts of the system.
Server Shutdown
Coordinator Server Shutdown
The Coordinator Server uses a multi-stage shutdown process to safely terminate all services in the correct order.
Shutdown Process
-
Shutdown Hook Registration: The server registers a JVM shutdown hook that triggers graceful shutdown on process termination
-
Service Termination: All services are stopped in a specific order to maintain consistency:
Coordinator Server Shutdown Order:
- Server Metric Group → Metric Registry (async)
- Auto Partition Manager → IO Executor (5s timeout)
- Coordinator Event Processor → Coordinator Channel Manager
- RPC Server (async) → Coordinator Service
- Coordinator Context → Lake Table Tiering Manager
- ZooKeeper Client → Authorizer
- Dynamic Config Manager → Lake Catalog Dynamic Loader
- RPC Client → Client Metric Group
-
Resource Cleanup: Executors, connections, and other resources are properly closed
# Graceful shutdown via SIGTERM
kill -TERM <coordinator-pid>
# Or using the shutdown script (if available)
./bin/stop-coordinator.sh
Tablet Server Shutdown
The Tablet Server supports a controlled shutdown process designed to minimize data unavailability and ensure leadership handover before termination.
Shutdown Order:
- Tablet Server Metric Group → Metric Registry (async)
- RPC Server (async) → Tablet Service
- ZooKeeper Client → RPC Client → Client Metric Group
- Scheduler → KV Manager → Remote Log Manager
- Log Manager → Replica Manager
- Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
Controlled Shutdown Process
- Leadership Transfer: The server attempts to transfer leadership of all buckets it leads to other replicas
- Retry Logic: If leadership transfer fails, the server retries with configurable intervals
- Timeout Handling: After maximum retries, the server proceeds with unclean shutdown if necessary
# Initiate controlled shutdown
kill -TERM <tablet-server-pid>
Configuration Options
The controlled shutdown process can be configured using the following options:
tablet-server.controlled-shutdown.max-retries
: Maximum number of attempts to transfer leadership before proceeding with unclean shutdown (default: 3)tablet-server.controlled-shutdown.retry-interval
: Time interval between retry attempts (default: 1000ms)
Example Configuration:
# server.yaml
tablet-server:
controlled-shutdown:
max-retries: 5
retry-interval: 2000ms
Monitoring Shutdown
Logging
Fluss provides detailed logging during shutdown processes:
- INFO: Normal shutdown progress
- WARN: Retry attempts or timeout warnings
- ERROR: Shutdown failures or exceptions
Metrics
Monitor shutdown-related metrics:
- Shutdown duration
- Failed shutdown attempts
- Resource cleanup status
Troubleshooting
Common Issues
Issue | Possible Causes | Recommended Actions |
---|---|---|
Hanging shutdown | Blocking operations, thread pool misconfiguration, or deadlocks | Check for blocking calls without timeouts, inspect thread dumps |
Resource leaks | Unclosed resources or connections | Verify all AutoCloseable resources and file handles are closed |
Data loss | Unclean shutdown or failed leadership transfer | Always use controlled shutdown for Tablet Servers and verify replication factor |
Debug Steps
- Enable debug logging for shutdown components
- Monitor JVM thread dumps during shutdown
- Check system resource usage
- Verify network connection states
Configuration Reference
Configuration | Description | Default |
---|---|---|
tablet-server.controlled-shutdown.max-retries | Maximum retries for controlled shutdown | 3 |
tablet-server.controlled-shutdown.retry-interval | Interval between retry attempts | 1000ms |
shutdown.timeout.ms | General shutdown timeout | 30000 |