January 3, 2025

turbolytics/sql-flow: DuckDB for streaming data

Blog

SQLFlow supports SQL-based stream processing, powered by duck database. SQLFlow embeds duckdb and supports Kafka stream processing Use pure sql logic.

SQLFlow executes SQL against streaming data (such as Kafka or Webhooks). Think of SQLFlow as a way to execute SQL against a continuous flow of data. Data output can be sent to a sink such as Kafka.

Streaming data conversion: Clean up materials and types and publish new materials (Example configuration).
Flow rich: Add data to the input stream and publish new data (Example configuration).
Data aggregation: Aggregate input data batches to reduce the amount of data (Example configuration).
Flip window aggregation: Save data to any time window (e.g. “Hours” or “10 Minutes”) (Example configuration).
Execute SQL against Bluesky Firehose: Execute SQL against any webhook source, e.g. blue sky fire hose (Example configuration)

Streaming SQL: Execute SQL against various input streams, including kafka and websockets (bluesky firehose).
Custom serialization and encoding: Supports multiple formats such as JSON and Parquet.
High throughput: Optimized to handle tens of thousands of messages per second duck database, library fileand They met a python
Tumbling window aggregation: Execute aggregation at fixed time intervals and output the data after the time interval is completed, enabling summary such as hourly or 10-minute summary.
static table join: Use SQLFlow to connect streaming data with static data sets (such as CSV).

Docker is the easiest way to get started.

Pull the sql-flow docker image

docker pull turbolytics/sql-flow:latest

Verify the configuration by calling it on test data

docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json

['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']

Start kafka locally using docker

docker-compose -f dev/kafka-single.yml up -d

Publish test messages to kafka

python3 cmd/publish-test-data.py --num-messages=10000 --topic="topic-local-docker"

Start the kafka consumer from within the docker-compose container

docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-local-docker

docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest run /tmp/conf/config/local.docker.yml

Verify output in kafka consumer

...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}

this dev invoke Command allows testing SQLFlow pipeline configurations on a batch of test data. This enables quick feedback to local development before starting SQLFlow users reading from kafka.

The core of SQLFlow is the pipeline profile. Each profile specifies:

kafka configuration
Pipeline configuration
- source:Input configuration
- handler:SQL conversion
- sink: Output configuration

Each instance of SQLFlow requires a pipeline profile.

Consume blue sky fire hose

SQLFlow supports DuckDB via websocket. Execute SQL against the following objects blue sky fire hose is a simple configuration file:

Call sql-flow with the configuration listed above:

View profile here

Coming soon, check out before then:

Execute multiple instances of SQLFlow on the same file system

Native verification configuration

pip install -r requirements.txt
pip install -r requirements.dev.txt

C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka

The table below shows the performance for different test scenarios:

Name	Throughput	Maximum RSS memory	Peak memory usage
simple aggregate memory	45,000 messages/second	230 terabytes	130 terabytes
Simple aggregate disk	36,000 messages/second	256 bytes	102 metric bytes
Rich	13,000 messages/second	Chapter 368	124 bytes
CSV disk connection	11,500 messages/second	Chapter 312	152 mbytes
CSV memory link	33,200 messages/second	300 terabytes	107 mB
memory scroll window	44,000 messages/second	198 mB	96 mB

For more information on benchmarking, see the wiki.

Like SQLFlow? Using SQLFlow? Feature request? Please tell us! danny@turbolytics.io

2024-12-31 13:13:20

Data DuckDB streaming turbolyticssqlflow video dawnloader free video dawnloader free online VideoDDD

turbolytics/sql-flow: DuckDB for streaming data

Consume blue sky fire hose

Execute multiple instances of SQLFlow on the same file system

Native verification configuration

Leave a Reply Cancel reply