As much as “90% of the data in the world” was created in the last 10 years, according to Accenture. The consultancy also predicts that there will be about 175 zettabytes (or 175 trillion gigabytes) of data created by 2025.
Despite the gargantuan amount of data being collected, poor data quality continues to cost businesses an average of $12.9 million every year. How, then, can businesses maintain data quality while accumulating more and more information?
The answer depends on how you choose to manage your data. In the last few years, enterprises witnessed an evolutionary trend in data architecture, from data centralization, such as in the data warehouse and the data lake, to data decentralization, as seen in the data mesh. For businesses that want to unlock the best of business intelligence, your data management approach significantly impacts your ability to make reliable data-driven decisions.
In this article, we explore the potential of data centralization and data decentralization to improve data discoverability, accessibility, interoperability, and security.
Overview of Data Decentralization
Data decentralization refers to a data management approach in which the storage, cleaning, optimization, output, and consumption of data are distributed without the need for a central repository. Data decentralization breaks out data products across different organizational departments to reduce the complexity and challenges of dealing with large amounts of data, changing schema, downtime, upgrades, and backward data compatibility.
The data mesh is an example of a data management framework that takes the data centralization approach.
What Is a Data Mesh?
A data mesh is an enterprise data management framework defining how to manage business-domain-specific data in a manner that allows the business domains to own and operate their data. It empowers domain-specific data producers and consumers to collect, store, analyze, and manage data pipelines without the need for an intermediary data-management team.
The data mesh has its origins in distributed computing, where software components are shared among multiple computers running together as a system. In the data mesh, the ownership of data is distributed across different business domains, and each domain is responsible for creating its data products. The idea of the data mesh was first defined by Zhamak Dehghani, a technology consultant at Thoughtworks, in 2019.
The data mesh also enables easier contextualization of data to generate deeper insights while concurrently facilitating more collaboration from domain owners to create solutions tailored to specific business needs.
The architecture of the data mesh has information stored across multiple sources, and a data formation service makes the data products available as permissioned tables. The data owner may also create and expose APIs that other users can consume. The data mesh also has a data catalog that stores metadata, such as table names, columns, and user-defined tags.
The key benefits of a data mesh include:
Decentralizing data ownership and data operations to accelerate the agility of your business domains to make relevant decisions
- Providing domain teams with the independence to choose the data technology stack that best meets their needs
- Delivering transparency across cross-functional teams by reducing the likelihood of isolated data teams
- Facilitating data sovereignty and data residency to ensure alignment with data governance regulations
Overview of Data Centralization
Data centralization is a function of traditional monolithic data infrastructure that handles the storage, cleaning, optimization, output, and consumption of data in a central location. While data centralization ensures that data is managed from a central source, it is also designed to make the data accessible from many different points.
Data centralization minimizes information silos, enables more collaboration, and makes it easy to see and predict the potential impact of emerging trends or proposed changes across different departments. A centralized data view also helps align data strategy with business strategy by delivering a 360-view of trends, insights, and predictions so that everyone in the organization can pull in the same direction.
The data warehouse and the data lake are examples of data management systems that take the data centralization approach.
What Is a Data Warehouse?
A data warehouse is a first-generation enterprise data management system that collects and manages proprietary data from different sources within a centralized platform to synthesize business intelligence.
The architecture of a data warehouse cuts across multiple tiers, with the top tier being a front-end client in which you can access analysis, data mining, and reporting tools. The middle tier has the analytics engine, and the bottom tier holds the database server.
The schema of the data warehouse is schema-on-write. It may allow for multiple databases, and each database is organized into a hierarchical format of files and folders.
The key benefits of a data warehouse include:
- Consolidating data from multiple sources
- Enabling the analysis of historical data
- Ensuring consistent data format, quality, and accuracy
- Facilitating the separation of transactional databases from analytics for improved performance
However, creating data products from the data warehouse tends to become complicated, time-consuming, and potentially costly because people usually underestimate the resources required for data loading.
What Is a Data Lake?
A data lake refers to a centralized repository of raw unprocessed data from various sources without a definite plan for how and when it will be used. It’s a second-generation enterprise data management system focused on managing big data.
The architecture of a data lake manages information in the cloud with the use of a data lake console and a data lake CLI in the front end. In the back end, you’ll have the data lake RESTFUL API, lambda function, directories, data catalog, an OpenSearchServer, and more.
The data lake allows you to manage multiple data types — including relational and non-relational data — in a raw granular format within a flat architecture. Because the data is stored in its raw state, the schema of the data lake is schema-on-read, and it is created at the time of data analysis, so you get query results faster.
The key benefits of the data lake include:
- Facilitating the faster development of machine learning models
- Promoting faster data movement by importing large amounts of data in real-time
- Enhancing the crawling, cataloging, indexing, and security of data
- Empowering R&D teams to test hypotheses, refine assumptions, and track results
While business analysts may be able to use the data warehouse, the data lake requires the expertise of data scientists and data developers with specialized tools to navigate complex datasets because poor data integrity and security lapses from non-experts could turn the data lake into a dead data swamp.
When Is a Centralized Data Management Approach Right?
Centralized data solutions such as data lakes and data warehouses are useful in some instances:
- If your company is only just starting out with data management, and you have few business domains or a minimal dataset. This is especially relevant if you have cross-functional teams where people wear multiple hats. You might be better off having a centralized data team than having to create a data team to support every job function.
- If big data is crucial to your business operations, you have to store, prepare, and analyze a huge amount of data. Data centralization collects all of your business data in one place, so it’s easiest for the data team to clean up and prepare the data. Data centralization also enables the data team to run a unified compliance process to maintain data integrity.
- If your data management budget is low and you need affordable storage for high amounts of raw, structured, or unstructured data. Centralized data management systems help lower storage and compute costs because you can manage data in a single server or use a cloud solution in which the provider bears the overhead costs.
When Is a Data Mesh a Better Data Management Approach?
A data mesh represents a shift to decentralized data management at the operational and technological levels. If you need more efficiency in the development of data products in your organization, a data mesh is a step in the right direction toward increasing efficiency, reducing operational costs, and synthesizing in-depth business insights.
You may also want to consider using a data mesh if:
- Your teams need to gather data from disparate heterogeneous sources for instant processing. The data mesh gives departments easy, local access to the information they need.
- Your teams need access to insights or reporting in a manner without having to queue their data requests with a centralized IT or data team.
You need to combine and analyze different types of structured and unstructured data. The fact that the data mesh manages data in domain-specific groups provides enhanced contextualization in the data products that your teams create.
Is a Data Warehouse, Data Lake, or Data Mesh Right for Your Business? It Depends
The data management architecture that you choose depends on your unique data needs and your plans for managing data in the future.
Your data management needs and available resources will determine whether you choose a data warehouse, data lake, or data mesh. However, the important thing is to make sure that your data platform doesn’t become a dumping ground for data. Rather, it should be an optimized system that empowers you to synthesize business intelligence efficiently.
Next steps:
Check out this whitepaper on how to jump-start your cloud data warehouse.
Learn more about how SnapLogic is bringing the future of data warehouse to the present.
Check out our whitepaper on how to build an enterprise data lake.
Learn more about SnapLogic’s role in the enterprise data lake.
Read how to implement enterprise automation and integrate a data lake or data warehouse.