
turbolytics/sql-flow: DuckDB for streaming data
SQLFlow supports SQL-based stream processing, powered by duck database. SQLFlow embeds duckdb and supports Kafka stream processing Use pure sql logic.
SQLFlow executes SQL against streaming data (such as Kafka or Webhooks). Think of SQLFlow as a way to execute SQL against a continuous flow of data. Data output can be sent to a sink such as Kafka.
- Streaming data conversion: Clean up materials and types and publish new materials (Example configuration).
- Flow rich: Add data to the input stream and publish new data (Example configuration).
- Data aggregation: Aggregate input data batches to reduce the amount of data (Example configuration).
- Flip window aggregation: Save data to any time window (e.g. “Hours” or “10 Minutes”) (Example configuration).
- Execute SQL against Bluesky Firehose: Execute SQL against any webhook source, e.g. blue sky fire hose (Example configuration)
- Streaming SQL: Execute SQL against various input streams, including kafka and websockets (bluesky firehose).
- Custom serialization and encoding: Supports multiple formats such as JSON and Parquet.
- High throughput: Optimized to handle tens of thousands of messages per second duck database, library fileand They met a python
- Tumbling window aggregation: Execute aggregation at fixed time intervals and output the data after the time interval is completed, enabling summary such as hourly or 10-minute summary.
- static table join: Use SQLFlow to connect streaming data with static data sets (such as CSV).
Docker is the easiest way to get started.
- Pull the sql-flow docker image
docker pull turbolytics/sql-flow:latest
- Verify the configuration by calling it on test data
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json
['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']
- Start kafka locally using docker
docker-compose -f dev/kafka-single.yml up -d
- Publish test messages to kafka
python3 cmd/publish-test-data.py --num-messages=10000 --topic="topic-local-docker"
- Start the kafka consumer from within the docker-compose container
docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-local-docker
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest run /tmp/conf/config/local.docker.yml
- Verify output in kafka consumer
...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}
this dev invoke
Command allows testing SQLFlow pipeline configurations on a batch of test data. This enables quick feedback to local development before starting SQLFlow users reading from kafka.
The core of SQLFlow is the pipeline profile. Each profile specifies:
- kafka configuration
- Pipeline configuration
- source:Input configuration
- handler:SQL conversion
- sink: Output configuration
Each instance of SQLFlow requires a pipeline profile.
SQLFlow supports DuckDB via websocket. Execute SQL against the following objects blue sky fire hose is a simple configuration file:
Call sql-flow with the configuration listed above:
Coming soon, check out before then:
pip install -r requirements.txt
pip install -r requirements.dev.txt
C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka
The table below shows the performance for different test scenarios:
Name | Throughput | Maximum RSS memory | Peak memory usage |
---|---|---|---|
simple aggregate memory | 45,000 messages/second | 230 terabytes | 130 terabytes |
Simple aggregate disk | 36,000 messages/second | 256 bytes | 102 metric bytes |
Rich | 13,000 messages/second | Chapter 368 | 124 bytes |
CSV disk connection | 11,500 messages/second | Chapter 312 | 152 mbytes |
CSV memory link | 33,200 messages/second | 300 terabytes | 107 mB |
memory scroll window | 44,000 messages/second | 198 mB | 96 mB |
For more information on benchmarking, see the wiki.
Like SQLFlow? Using SQLFlow? Feature request? Please tell us! danny@turbolytics.io
2024-12-31 13:13:20