As organizations grapple with how to effectively manage ever voluminous and varied reservoirs of big data, data lakes are increasingly viewed as a smart approach. However, while the model can deliver the flexibility and scalability lacking in traditional enterprise data management architectures, data lakes also introduce a fresh set of integration and governance challenges that can impede success.
The power and potential of data lakes
Born from the rise of the cloud and big data technologies like Hadoop, data lakes provide a way for organizations to cost-effectively store nearly limitless amounts of structured and unstructured data from myriad sources without regard to how that data might be leveraged in the future. By its very nature and through self-service business intelligence capabilities, a data lake also encourages experimentation and data exploration by a broader set of non-business analyst users. According to a survey conducted by TDWI Research, 85 percent of respondents considered the data lake an opportunity to address the challenges they face trying to manage the data deluge with traditional relational databases. Moreover, the TDWI survey found the data lake being pursued for a variety of benefits and use cases, the most prominent being advanced analytics (49 percent) and data discovery (49 percent).
Despite the power and potential of the technology, organizations are likely to bump up against entirely new data management and data integration headaches if they approach data lakes without a cohesive and well-planned strategy. Traditional data integration solutions, including enterprise service bus (ESB), extract, transform, and load (ETL) tools, and custom code aren’t able to manage the volume and variation of structured and non-structured data nor are they able to work effectively with schema-less data storage or handle real-time data streams. With those caveats in mind, adhering to the following best practices can ensure a smoother data lake roll out and more effective migration and integration plan:
Embrace data governance. Yes, the data lake is flexible and unstructured, but without attention to formal governance practices, it can rapidly turn into a hard-to-navigate, impossible-to-manage data swamp. What’s crucial is to establish controls via policy-based data governance, helped along by a qualified data steward, as well as to enforce a metadata requirement, which will ensure users can find data and optimize queries. Designing for automated metadata creation is one way to ensure consistency and accuracy.
Build on governance with zones. Data in a data lake can be logically or physically separated by function, which can help keep the environment organized. While there are lots of approaches to this strategy, some experts suggest maintaining a zone for short-lived data before it’s ingested, another for raw data such as sensor data or weblogs, and then trusted zones of data that have gone through quality routines and validation, thus can become the source for other downstream systems.
Evaluate more modern integration methods. Existing data integration solution like ESBs and ETL tools can’t accommodate the unique needs of a data lake, including the need to import and export data in real time and to work with unstructured data, which is often changing at a precipitous pace. In comparison, new data integration approaches are purpose-built to work with large amounts of data without a native hierarchical structure and many offer pre-built connectors that empower “citizen developers” to handle some of this work without reliance on IT.
Staff up accordingly. It’s hard enough to find qualified data warehouse experts or business intelligence analysts, but big data and the accompanying analytics requirements kick the skill set level up a notch. Given the relative newness of technologies like Hadoop most organizations don’t have trained specialists in that discipline or in other relevant competencies such as data flow technologies like Flume and Spark. To ensure the right mix of talent is in place, IT organizations should identify high performing individuals who can be trained in some of these emerging skill sets as well as bring in outside contract experts when and wherever appropriate.n
Data lakes can help organizations make good on the promises of big data analytics for unearthing insights and driving data-driven innovation. Yet the new model requires adherence to governance and new integration practices to ensure the journey is smooth, not swampy.