turbolytics/sql-flow: DuckDB for streaming data
January 3, 2025

turbolytics/sql-flow: DuckDB for streaming data

SQLFlow supports SQL-based stream processing, powered by duck database. SQLFlow embeds duckdb and supports Kafka stream processing Use pure sql logic.

SQLFlow executes SQL against streaming data (such as Kafka or Webhooks). Think of SQLFlow as a way to execute SQL against a continuous flow of data. Data output can be sent to a sink such as Kafka.

  • Streaming SQL: Execute SQL against various input streams, including kafka and websockets (bluesky firehose).
  • Custom serialization and encoding: Supports multiple formats such as JSON and Parquet.
  • High throughput: Optimized to handle tens of thousands of messages per second duck database, library fileand They met a python
  • Tumbling window aggregation: Execute aggregation at fixed time intervals and output the data after the time interval is completed, enabling summary such as hourly or 10-minute summary.
  • static table join: Use SQLFlow to connect streaming data with static data sets (such as CSV).

Docker is the easiest way to get started.

  • Pull the sql-flow docker image
docker pull turbolytics/sql-flow:latest
  • Verify the configuration by calling it on test data
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json

['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']
  • Start kafka locally using docker
docker-compose -f dev/kafka-single.yml up -d
  • Publish test messages to kafka
python3 cmd/publish-test-data.py --num-messages=10000 --topic="topic-local-docker"
  • Start the kafka consumer from within the docker-compose container
docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-local-docker
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest run /tmp/conf/config/local.docker.yml
  • Verify output in kafka consumer
...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}

this dev invoke Command allows testing SQLFlow pipeline configurations on a batch of test data. This enables quick feedback to local development before starting SQLFlow users reading from kafka.

The core of SQLFlow is the pipeline profile. Each profile specifies:

  • kafka configuration
  • Pipeline configuration
    • source:Input configuration
    • handler:SQL conversion
    • sink: Output configuration

Each instance of SQLFlow requires a pipeline profile.

Consume blue sky fire hose

SQLFlow supports DuckDB via websocket. Execute SQL against the following objects blue sky fire hose is a simple configuration file:

Call sql-flow with the configuration listed above:

View profile here

Coming soon, check out before then:

Execute multiple instances of SQLFlow on the same file system

Native verification configuration

pip install -r requirements.txt
pip install -r requirements.dev.txt

C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka

The table below shows the performance for different test scenarios:

Name Throughput Maximum RSS memory Peak memory usage
simple aggregate memory 45,000 messages/second 230 terabytes 130 terabytes
Simple aggregate disk 36,000 messages/second 256 bytes 102 metric bytes
Rich 13,000 messages/second Chapter 368 124 bytes
CSV disk connection 11,500 messages/second Chapter 312 152 mbytes
CSV memory link 33,200 messages/second 300 terabytes 107 mB
memory scroll window 44,000 messages/second 198 mB 96 mB

For more information on benchmarking, see the wiki.

Like SQLFlow? Using SQLFlow? Feature request? Please tell us! danny@turbolytics.io

2024-12-31 13:13:20

Leave a Reply

Your email address will not be published. Required fields are marked *