December 23, 2024

I built a data pipeline tool in Go

Blog

Over the past few years, the data world has convinced itself that it requires many different tools to extract insights:

a tool for obtaining information
another way to transform it
Another piece to check the quality
another to coordinate all of this
Another one for cataloging
Another one for governance

The result? The infrastructure is fragile, expensive, and inflexible, and the experience is terrible. The team has to build a lot of glue between these systems and try to get the different parts of these systems talking to each other while trying to get the analytics team on board.

Is it effective? No.

Are we ready to have this conversation? I hope so.

Obsessed with influence

The engineering work behind building a ticking machine is very satisfying: every little part does its job and works like a clock. It feels like an engineering marvel:

It’s simple: you push your code to the master branch, and the backend automatically pulls the branches, pulls the DAG, and uploads them to S3. The synchronized sidecar on the Airflow container will automatically pull from S3 and then update the DAG. When the DAG is running, for the data ingestion job, it will connect to our Airbyte deployment and trigger Airflow ingestion, and then we set up a sensor and wait for the ingestion to complete. We then connect to dbt Cloud to launch certain parts of the conversion job from the analytics team, if any failures occur Airflow will connect to our notification system to find the correct teams (if they are defined in the directory), if not we Check our AD users and try to find matching organizations to send notifications to. After the conversion is complete, we execute custom Python operators to perform X and Y, and then configure a pod in the Kubernetes cluster to perform quality checks. Meanwhile, our Kafka sink is using Debezium to get the CDC data from Postgres internally and then loading them into the data lake in Parquet format, then we register them as Glue tables so they can be queried and then the sensors in the Airflow cluster persist Track these states to run SQL transformations using an internal framework, and…

Sounds ridiculous, doesn’t it? This certainly rings true for me, and it’s a very common answer when we ask engineering teams what their data infrastructure looks like. The joy they get from building a house of cards is far more important than the business impact. Meanwhile, the analytics team, data analysts, data scientists, and business teams are waiting for their questions to be answered, trying to understand why it’s taking 6 weeks to get new charts on the sales dashboard.

I’m not sure if this is due to ZIRP, but it’s easy to see how highly ineffective engineering teams in organizations, coupled with engineering leaders who don’t know what their teams are doing, are dominating the game, and where are the people who use data? Ignore the people who create real value. They have to jump through a billion different tools trying to figure out why the dashboard isn’t updating, and wait for a response from the central profile team on their ticket.

They are data analysts on the business team, growth hackers running marketing campaigns on 5 different platforms, or all-around data scientists trying to predict LTV. They try to create real impact, but their progress is severely hampered by the toys inside.

We’re building Bruin for these people: simpler data tools for impact-focused teams.

Bruin CLI and VS Code extensions

Brown CLI is an end-to-end data pipeline tool that integrates data ingestion, data transformation using SQL and Python, and data quality in one framework.

Bruin includes battery:

📥 Ingest data ingestor /Python
✨ Run SQL and Python conversions many platforms
📐 Table/View materializeincrement table
🐍 Run Python in an isolated environment UV rays
💅Built-in data quality check
🚀 Jinja templates to avoid duplication
✅ Pilot end-to-end verification pipeline
👷 Execute on local machine, EC2 instance, or GitHub operations
🔒 Secret injection through environment variables
VS code extension For a better developer experience
⚡ Written in Golang
📦 Easy to install and use

This means that using Bruin, teams can build end-to-end workflows without having to resort to a bunch of different tools. It is sufficiently extensible through the use of SQL and Python, while also guiding users through its opinionated approach to building maintainable data pipelines.

One of the features that comes with the Bruin CLI is our open source Visual Studio Code extension:

This extension does a few things that make it very unique:

While everything in Bruin is code-driven, this extension adds a UI layer on top, meaning you get:
- Visual documentation
- rendered query
- Chromatography columns and quality inspection
- bloodline
- Verify code and the ability to run backfills
- syntax highlighting
Everything happens locally, meaning no external servers or systems can access any of your data
The extension visualizes many configuration options, which makes it simple to run backfills, validations, and more.

This is a great example of our design principles: everything is version controlled while also providing a better experience through a thoughtful UI.

This extension is a first-class citizen of the Bruin ecosystem, and we intend to further extend its capabilities to make it the easiest platform for building data workloads.

Supported platforms

Bruin supports many cloud data platforms out of the box at launch:

AWS Athena
data block
duck database
Google Big Query
Microsoft SQL Server
Postgres
red shift
snowflake
synapse

Over time, our list of supported platforms will grow. We’re always looking to hear feedback from the community on this, so please feel free to share your thoughts with us Our Slack community.

brown bear cloud

We’re building Bruin for those addicted to influence. You can go from zero to a complete data pipeline in minutes, and we’re committed to making this experience even bigger. Using all of our open source tools, you can build and run all your data workloads locally, in GitHub Actions, Airflow, or anywhere else.

While we do believe that the Bruin CLI has many useful deployment options on different infrastructures, we are also committed to building the best hosting experience for building and running Bruin workloads in production. That’s why we built Bruin Cloud:
Genealogy View on Bruin Cloud

It has many advantages:

Hosted environment for ingest, transformation and ML workloads
Ranked pedigree
Governance and cost reporting
Team management
Cross-pipeline dependencies
Multi-warehouse “grid”

There are many more. Please feel free to leave your email for a demo.

Share your thoughts

We’re excited to share the Bruin CLI and VSCode extensions with the world, and we love hearing from the community. We’d be grateful if you could share your ideas for making Bruin better suited to your needs.

https://github.com/bruin-data/bruin

2024-12-23 08:30:00

built Data Pipeline tool video dawnloader free video dawnloader free online VideoDDD