Previously published on information-age.com.
Both the data lake and the data warehouse within the cloud have their benefits. While data lakes consist of unorganized lagoons without categories, they are great for data scientists to analyze different kinds of data at once.
The two kinds of data storage also differ in the tools that can be accessed.
“In general, for the likes of Redshift, Snowflake, Azure, SQL Data Warehouse, one of the most important things when you talk about a data warehouse is the accessibility to the tools that are available today and that people are familiar with,” said SnapLogic CTO Craig Stewart.
“This can be something like Microstrategy or Tableau, or something like AWS Insights or Microsoft Power BI, all of which can talk SQL to that data store.
“That’s really what differentiates the data warehouse from the data lake. The accessibility to those tools, as well as the query ability in that SQL form makes it democratized so that anybody who can use SQL can use those things, whereas if you were talking about a data lake, you have a much more diverse set of capabilities, the API’s to deal with the files like Parquet, etcetera. That’s much more wide open, and generally tends to require much deeper knowledge.”
Differences in capabilities
When discussing the best way to store data, it depends on the vendor you use, according to Stewart.
“The things that Amazon and Microsoft are doing with S3 and the various different file systems that Amazon have produced, it’s interesting that the Microsoft Azure platform has now got three different file systems, which are a little confusing to users, but what they are doing is they’re iterating on the file systems to give the best functionality for what people are trying to do.
“So in the context of a data warehouse, the latest file system, the Azure Data Lake Storage Gen2 is particularly suited for the data lake and the access you need to have to that from things like Spark, to get the best performance.
“The good thing about the Amazon world is that the S3 has been consistent for many years, so there hasn’t been a need for too many iterations. They’re providing some additional capabilities, updates to security and things you can iterate on, but not a wholesale change of API that the Azure environment has had. But of course on top of that, it’s a matter of what format you store it in.
“Certainly in the data warehouse world, Parquet has really taken off as the format of choice because of its compact nature, and if you’ve partitioned, the relatively fast speed that you can get out of that.”
Benefits of a cloud data warehouse
SnapLogic’s CTO went on to identify two benefits for using cloud data warehouse platforms in particular.
“Firstly, it’s a devolution of what we used to do some years ago when we were trying to offload queries from the corporate transactional systems, which is what we would call ‘query offloading’,” said Stewart. “Taking the data, putting it somewhere else in another database so you can query it without impacting the operational system.
“Now, that really has gone away. Those databases on-premise are now using the cloud data warehouses. That’s more what people are trying to do, and finding that offers them value.”
The second perk that the cloud data warehouse offers, according to Stewart, involves its scalability.
“Rather than having to build out the full scale that you might want at any point in time, which you used to have to build out, now cloud data warehouses do actually have the ability to scale on-demand,” he said.
“When I go to run my day, week or month-end reporting, and I need more oomph to apply to this, I can actually now do that for the period of the hours in which I want to do that, and the rest of the month I can scale it back to business as usual-type levels.
“This gives customers firstly a cost-benefit, but also we’re not burning all those fossil fuels to power this anymore. That elastic scalability of the cloud is being realized in the cloud data warehouse world more significantly than most other areas.”
The challenges
Of course, operating a cloud data warehouse isn’t without its difficulties, and this isn’t just the lower levels of data diversity compared to data lakes.
One challenge that was addressed by Stewart was the costs that come with moving data to the cloud.
“There is the idea that you can move everything into the cloud,” he explained. “You can, but there is a cost associated with it, without a doubt, not only in terms of moving the data to that environment but also keeping it in that environment.
“One of the benefits of a data lake versus a cloud data warehouse is that the data lake is much more about passive storage as opposed to the cloud data warehouse, where you are actively maintaining the different tables.
“When trying to get that balance right, a cloud data warehouse is going to cost you considerably more than just the basic storage, and understanding the balance of what I should put in my data warehouse versus keeping just in storage, and the benefit is in moving stuff on demand into the cloud data warehouse.”
A second challenge, for SnapLogic’s CTO, stems from making queries to data and communication between IT and decision-makers.
“Using the likes of Redshift Spectrum, external tables essentially so you define them in the cloud data warehouse, but when you actually query them, it actually, in the background, also makes a query on those native files, and it’s not kept directly in memory,” said Stewart.
“So understanding what the balance is across those things is an important part of the cloud data warehouse, and from Snaplogic’s perspective, how you get the data to that is the challenge that we’re addressing, making that a task which can be undertaken by the line of business rather than the IT team, and I think that’s an important thing.
“If within an organization they want to be able to be agile, get the data queryable and usable in a short space of time, they do need that capability of saying ‘We can’t wait for IT to move our data across because IT generally has a backlog, and they’re dealing with the daily running of the business, rather than the more agile processes that the line of the business is trying to get to be able to do those things, like changing product lines, changing pricing, and being able to understand those things.’”