Hadoop / Big Data in Enterprise
The Big Data ecosystem came into existence to deal with the massive amount of data generated via web/online activity. Once the major components of the ecosystem matured, it didn’t take much time for enterprise organizations to use these new tools and technologies for their use cases. Today, enterprise organizations are dumping all kinds of structured, semi-structured as well as unstructured data into their data lakes. But the major challenge faced by them is how to make sense of the massive data in their data lake. Using the SnapLogic Elastic iPaaS (Integration platform as a service), which is based on the visual programming paradigm, customers can address this issue with ease. It provides a powerful web-based Designer and hundreds of prebuilt connectors (Snaps), which customers can drag and drop to build their data pipelines to cleanse and reshape the data in the required format, and they can do it at big data scale and in a big data environment.
Security
There is another big problem using Hadoop in the enterprise: Security. The way most organizations deal with this issue is by using Kerberos, which is a defacto standard for implementing security in distributed systems. Kerberos provides security by authenticating and authorizing users. Using HDFS encryption feature, customers can secure their data in motion (wire) as well as data at rest (disk).
Kerberos
Kerberos is a widely used network authentication protocol in a distributed computing environment developed originally by MIT. Main components of this system involve KDC, which consists of an Authentication server and a Ticket granting server. <link to resources at bottom of document>
Why Kerberos?
- Built on strong symmetric key cryptography.
- Neither stores password locally nor transmits it over the network. Uses tickets instead.
- Uses trusted third party (KDC) to drive the authentication.
- Light weight. Tickets are valid till expiry time hence interaction with KDC is minimal.
- Session oriented.
- KDC can expire tokens making administration a breeze.
- It is a widely implemented single sign on configuration.
- Widely used in the Hadoop world where it secures communication between service entities.
How to use Kerberos and User Impersonation with SnapLogic
SnapLogic supports Kerberos and User Impersonation out-of-the-box (with just a couple of changes to your cluster and SnapLogic config files).
We are assuming that the customer has knowledge of basic SnapLogic terminologies and Kerberos setup in their Hadoop cluster before proceeding further.
- Create a principal in KDC and create a keytab file corresponding to this principal. Let’s say the principal name created is “snaplogic” and keytab corresponding to this principle is “snaplogic.keytab”
- Copy this keytab to a well known location on all nodes in cluster where JCC nodes will be running.
- Create users in all nodes in cluster pipelines will be running pipelines. You can also configure LDAP to achieve this.
Changes to Cluster for enabling User Impersonation
This feature allows the logged in user to run pipelines (all types) in a Hadoop cluster as a pre-configured proxy user.
The following must be added to the cluster’s safety valve core-site.xml. (replace snaplogic with your principal name)
<property>
<name>hadoop.proxyuser.snaplogic.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.snaplogic.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.snaplogic.users</name>
<value>*</value>
</property>
Changes to SnapLogic config files (Kerberos + User Impersonation)
The following should be added to the plex.properties file. Default value for this attribute is false.
jcc.proxy_user_enabled=true
snapreduce.keytab=/snaplogic.keytab
snapreduce.principal=snaplogic/[email protected]
Starting
- Start authentication with KDC using kinit with your SnapLogic user and keytab
bash. # kinit snaplogic/[email protected] -k -t /snaplogic.keytab
- Start the Hadooplex.
bash. # yarn jar yplex-4.0-snapreduce.jar -war jcc-4.0-mrc236-snapreduce.war -driver driver-mrc236.jar -master_conf master.properties -plex_conf plex.properties -keys keys.properties
Start running the pipeline via Designer. The logged in user will only be able to access files he is authorized to access. Any new files created will be owned by the logged in user.
Resources:
Kerberos authentication system:
https://www.youtube.com/watch?v=KD2Q-2ToloE
http://www.roguelynn.com/words/explain-like-im-5-kerberos/
http://web.mit.edu/kerberos/krb5-latest/doc/
Kerberos and user impersonation configuration for snaplogic:
Please visit doc.snaplogic.com for latest documentation on the topic.
Be sure to check out some of our other SnapLogic big data integration blog posts. SnapLogic is also looking for Sr. Big Data Developers; apply on our website at www.snaplogic.com/jobs today.