Demystifying data fabrics – bridging the gap between data sources and workloads
January 15, 2025

Demystifying data fabrics – bridging the gap between data sources and workloads

The term data factory is used throughout the technology industry, but its definition and implementation may vary. I’ve seen this with different vendors: British Telecom (BT) talked about its data factory at an analyst event last fall; At the same time, in the storage space, NetApp is refocusing its brand on intelligent infrastructure, although it previously used that term. Application platform provider Appian has a data factory product, and database provider MongoDB is also talking about data factories and similar ideas.

At its core, a data factory is a unified architecture that abstracts and integrates disparate data sources to create a single data layer. The principle is to create a single, synchronized layer between disparate data sources and the workloads that require access to the data—your applications, your workloads, and, increasingly, your AI algorithms or learning engines.

There are many reasons to want this overlay. Data Factory acts as a universal integration layer, connecting to different data sources or adding advanced capabilities to facilitate access to applications, workloads, and models, such as providing access to these sources while keeping them in sync.

So far, so good. The problem, however, is that we have a gap between the data factory principle and its actual implementation. People use this term to mean different things. Returning to our four examples:

  • BT defines a data fabric as a network-level overlay designed to optimize data transmission over long distances.
  • NetApp’s interpretation (even using the term “intelligent data infrastructure”) emphasizes storage efficiency and centralized management.
  • Appian positions its Data Fabric product as a tool for unifying data at the application level, allowing for faster development and customization of user-centric tools.
  • MongoDB (and other structured data solution providers) consider data structure principles in the context of a data management infrastructure.

How do we overcome all this? One answer is to recognize that we can approach this issue from different perspectives. You can talk about a data factory conceptually, recognizing the need to combine data sources without going overboard. You don’t need a universal “super-fabric” that covers absolutely everything. Instead, focus on the specific data you need to manage.

If we rewind a couple of decades, we see similarities with the principles of service-oriented architecture, which sought to decouple service delivery from database systems. Back then we discussed the difference between services, processes and data. The same applies now: you can query a service or query data as a service, focusing on what’s needed for your workload. Create, read, update and delete remain the simplest data processing services!

I’m also reminded of the origins of network acceleration, which used caching to speed up data transfers by storing versions of data locally rather than repeatedly accessing the source. Akamai has built its business on efficiently transmitting unstructured content such as music and movies over long distances.

This doesn’t mean data factories are reinventing the wheel. Technologically, we are in a different (cloud) world; They also bring new aspects, not least those related to metadata management, provenance tracking, compliance and security features. This is especially important for AI workloads, where data management, quality, and provenance directly impact model performance and reliability.

If you’re planning to deploy a data structure, it’s best to think about what you need the data for. Not only will this help you navigate what type of data structure might be most appropriate, but it will also help you avoid the pitfall of trying to manage all the data in the world. Instead, you can prioritize the most valuable subset of data and decide which level of data structure best suits your needs:

  1. Network layer: For data integration across multi-cloud, on-premises and edge environments.
  2. Infrastructure level: If your data is centralized to a single storage provider, focus on the storage layer to serve consistent pools of data.
  3. Application level: Bring together disparate data sets for specific applications or platforms.

For example, in the case of BT, they found intrinsic value in using their data structure to consolidate data from multiple sources. This reduces duplication and helps streamline operations, making data management more efficient. It is undoubtedly a useful tool for unifying silos and improving application rationalization.

After all, a data factory is not a monolithic, one-size-fits-all solution. It’s a strategic level of thinking, backed by products and features, that you can apply where it makes the most sense to increase agility and improve data delivery. The deployment framework is not a set-it-and-forget-it exercise: it requires ongoing efforts to scope, deploy, and maintain—not only the software itself, but also the configuration and integration of data sources.

While a data factory can conceptually exist in multiple locations, it is important not to duplicate delivery efforts unnecessarily. So, whether you collect data over the network, in the infrastructure, or at the application level, the principles remain the same: use it where it best suits your needs, and let it evolve with the data it serves.



2025-01-15 10:38:26

Leave a Reply

Your email address will not be published. Required fields are marked *