Modern data integration requires both reliable batch and reliable streaming computation to support essential business processes. Traditionally, in the enterprise software space, batch ETL (Extract Transform and Load) and streaming CEP (Complex Event Processing) were two completely different products with different means to formulating computations. Until recently, in the open source software space for big data, batch and streaming were addressed separately, such as MapReduce for batch and Storm for streams. Now we are seeing more data processing engines that attempt to provide models for both batch and streaming, such as Apache Spark and Apache Flink. In series of posts I’ll explain the need for a unified programming model and underlying hybrid data processing architecture that accommodates both batch and streaming computation for data integration. However, for data integration, this model must be at a level that abstracts specific data processing engines.
At SnapLogic we have developed a hybrid visual programming model and execution platform. We have implemented our own hybrid execution engine that can execute batch and streaming computations. In addition, we can transform SnapLogic dataflow pipelines into MapReduce and Spark computations. Our model insulates users from the complexity of the underlying data processing engines. This insulation allows SnapLogic pipelines to execute on the most appropriate target. It also hides changes in targets, and it allows easy migration from one target to another. This approach also enables users to take advantage of new data processing engines without porting data pipelines.
Understanding Batch Computation
A batch integration computation usually involves accessing an entire data set, such as a table or a collection in order to transform the data or to perform an analytic query. That is, the data input to the batch computation is completely available in a database or some other storage platform such as HDFS. Data transformations include filtering data, performing data quality operations, or augmenting and enriching data with additional external data. It is important to understand that some batch computations can work on each data element (row or document) independently of all other data elements. This is true for filtering and simple transformations. However, some computations have dependencies among the data elements. This is true for analytic queries that involve aggregation, sorting, or joining data sets. Finally some batch computations require multiple iterations over a data set. This is true for machine learning algorithms and certain types of graph computations (e.g., the PageRank algorithm).
A key component to modern data processing engines is the ability to tolerate failure during a potentially long computation. This is especially true when your computation is running on hundreds to thousands of commodity compute nodes. Google MapReduce was the first large-scale data processing engine to provide a programming model that allows a programmer to concentrate on the computation and hide the complexities of managing fault tolerance. The MapReduce run-time engine ensures that a MapReduce computation will eventually complete in the presence of network and node failures. MapReduce fault tolerance is achieved through re-execution of Map or Reduce tasks. The MapReduce implementation works closely with the underlying reliable distributed file system, GFS in the case of Google, to provide input to the re-execution of tasks. Hadoop MapReduce relies on HDFS and works in a similar manner to Google MapReduce. User-facing tools such as Pig and Hive ultimately run computations as MapReduce jobs.
Like MapReduce, Spark provides a programming model and run-time engine that ensures Spark computations will complete correctly in the presence of cluster failures. The Spark model and implementation, based on resilient distributed datasets (RDDs), provide a different way to formulate distributed computations and can result to much faster execution in many cases. Spark provides a higher-level API compared to MapReduce and individual operations are tracked. This ability to track the lineage of operations allows for re-execution and re-creation of RDDs in the event of a node failure. Unlike MapReduce, core Spark blurs the line between batch and interactive queries. Once data has been loaded into one more more Spark RDDs in a cluster, that data can be queried quickly to discover different results. This avoids reloading the same data into memory for independent queries. That said, like MapReduce, Spark is often used as a faster batch data processing engine. Like MapReduce, Spark relies on an underlying reliable distributed file system, such as HDFS, or a reliable distributed data store like Cassandra.
Pure batch computations are at the core of supporting main lines of business, reporting, forecasting, and scientific computing. As such, batch-oriented distributed computing will continue to be a critical technology component in most organizations.
My next post in this series will focus on Understanding Streaming Computation and I will conclude with the advantages of a batch and streaming data integration platform.