This is the first post in a series that will feature extracts from the new whitepaper: Will the Data Lake Drown the Data Warehouse? The paper is written by Mark Madsen, founder and president of Third Nature. Third Nature is a consulting and advisory company specializing in analytics and information management and the technology infrastructure required to support them. Mark Madsen is a well-known consultant and industry analyst who frequently speaks at conferences and seminars in the US and Europe and writes for a number of leading industry publications.
Learn more about SnapLogic for big data integration on our website and be sure to check out the webinar we hosted with Mark last month called: Building the Enterprise Data Lake: Important Considerations Before You Jump In.
“New opportunities in business require a new platform to process data. The data warehouse has been used to support many different query and reporting needs, but organizations want a general purpose, multi- application, multi-user platform that supports needs other than just query and reporting: the data lake.
To date, most lake deployments have been built through manual coding and custom integration. Most of this development effort is the first stage of work – once this is done, the useful work of building business applications can start.
Manual coding of data processing applications is common because data processing is thought of in terms of application-specific work. Unfortunately, this manual effort is a dead-end investment over the long term because products will take over the repeatable tasks. The new products will improve over time, unlike the custom code built in an enterprise that becomes a maintenance burden as it ages.
This puts technology managers in a difficult position today. The older data warehouse environments and integration tools are good at what they do, but they can’t meet many of the new needs. The new environments are focused on data processing, but require a lot of manual work. Should one buy, build or integrate components? What should one buy or build?
The answer to this is to focus not on specific technologies like Hadoop but on the architecture. In particular, one should focus on how to provide the core new capability in a data lake, general-purpose data processing.”
What’s different about a Data Lake?
“The core capability of a data lake, and the source of much of its value, is the ability to process arbitrary data. This is what makes it fundamentally different from a data warehouse. The functional needs of the lake include the ability to support the following:
- Store datasets of any size
- Process and standardize data, no matter what structure or form the data takes
- Integrate disparate datasets
- Transform datasets from one form into another
- Manage data stored in and generated by the platform
- Provide a platform for exploration of data
- Provide a platform that enables complex analytic or algorithmic processing
- Support the full lifecycle of data from collection to use to archival
- Refine and deliver data as part of operational processes, from batch to near real time”
In the next post in this series, Mark will describe the new data lake requirements and architecture. Be sure to download the entire whitepaper and check out Mark’s recent webinar presentation with SnapLogic here.