From a nascent Apache project in 2006 to being commercially supported data platform by two public companies, Apache Hadoop has come a long way. Initially adopted by Web 2.0 companies, such as Yahoo, Facebook, LinkedIn & Twitter, Hadoop-based data platforms started becoming a major part in enterprise data infrastructure starting 2011. Today, a majority of Fortune-500 companies have adopted Hadoop data lakes. Maturity of Hadoop-native SQL Query engines (such as Hive & Impala), availability of Hadoop-based BI Platforms & Machine Learning platforms (Apache Spark) have made Hadoop data lakes more accessible to traditional enterprise data practitioners. In addition, availability of hosted Hadoop services on all major public cloud providers have allowed anyone with a credit card to start working with Hadoop, reducing the skills gap.
Having had the fortune of developing Hadoop from day one, while I was at Yahoo, I have witnessed various stages of Hadoop adoption in the enterprises. Most enterprises adopt Hadoop-based platform for its cheap scale-out storage (HDFS), as an “Active Archive” or a staging data store. The second phase of Hadoop adoption is to migrate workloads that can be easily parallelized, and moved away from expensive data warehouses, the most common example being expensive Extract-Transform-Load workloads. The third phase is to perform advanced analytics at scale. Based on conversations with various customers and prospects over the last three years, we believe that most enterprises are somewhere between stage 2 & 3.
Hadoop data lake (either on-premises or in clouds) is where the raw data from other parts of the enterprise first land. These raw datasets may consist of web-server & application server logs, transaction logs from various front-end databases, periodic snapshots from dimensional datasets, events & metrics from hardware or software sensors. IT operations (Data Ops) are responsible for ensuring that the raw data is loaded in a timely manner in these data lakes, and that the data cleansing workloads create “source of truth data sets”. This is the first stage of the ETL pipeline (sometimes called ELT, as raw data is loaded first, and then transformed).
Many transformations on these source-of-truth data sets involve complex joins across multiple datasets, imposing structure on semi-structured/unstructured data (flattening nested data), segmentation (across geographies, demographics, products, time intervals). After these transformations, multiple analyzable data sets are created. Next, pre-written business intelligence queries, and reporting queries are run on these newly created datasets, and the results are uploaded to different systems for serving them to business analysts.
For organizations that have begun to utilize advanced analytics (deep analytics, machine learning etc), these workloads consisting of long-running chains of Hadoop jobs are launched as “cron jobs”. Since most of the data lakes are operated & managed by IT operations, the analytics workloads are carefully vetted to make sure that they execute correctly, and that the data lake has enough capacity to execute those workloads in timely manner, and that these workloads do not bring down the entire cluster or other mission-critical workloads with misbehavior.
In addition to the “production data workloads” that run regularly (say, every hour or every day), there are many other workloads in any organization, which are ad hoc in nature. (Merriam Webster dictionary defines ad hoc as “for the particular end or case at hand without consideration of wider application“.) Many data explorations, visualizations, data science, and machine learned model building workloads are ad hoc, as these workloads do not have a set frequency, nor are they carefully tuned to extract performance.
As Hadoop data lakes become single source of truth for data, various teams in the organization need access the Hadoop data lake in order to perform these ad hoc workloads. These workloads, being experimental in nature, are typically time-bound. A team may run experiments on the subset of data from the data lake for a few weeks or months, and then either throw away their work (in case of unsuccessful experiment), or may productionize it (if successful, and needed long term). The main challenge for the Hadoop data lakes is co-existence of mission-critical production data pipelines, and ad hoc workloads on the same data lakes.
In the early Hadoop deployments at Yahoo, we had to maintain several separate Hadoop clusters, some reserved for production workloads, and some for ad hoc analytical use cases. However, keeping the data synchronized between the two remained a challenge (DistCP, a tool used to copy data between different Hadoop clusters, was the biggest consumer of resources among all users). In addition, because of the burstiness of the ad hoc workloads, either query latencies were unpredictable, or cluster utilization was very low. This situation continues even today, among several enterprises.
Prior to advent of Hadoop data lakes, when a data warehouse was the single source of truth for enterprises, a similar challenge was solved by creating multiple “data marts“, typically, one per division within an organization, to allow ad hoc analytics on a subset of data away from data warehouse. In this usage pattern, Ampool performs the same role of data mart, where Hadoop data lake is the single source of truth.
The above block diagram shows the deployment & data flow, when Ampool is used as a data mart for Hadoop data lakes.
Several features in Ampool enable this usage pattern.
In addition, Ampool can be deployed on demand using docker containers, and can be orchestrated with popular engines such as Kubernetes. (Deployment with Mesosphere DCOS is on the roadmap.) Since the data stored in Ampool is dynamically updateable, as the underlying data in the Hadoop data lake is changed, those updates can be reflected in-place on the Ampool data mart.
In near future, we are working on simplifying populating the Ampool data mart even more. Our goal is to make it as simple as working with a source code repository (e.g. git checkout, git pull, and git push).
You should consider Ampool as an augmentation to your Hadoop data lake, if:
If you are interested in exploring Ampool for this use case, write to us to schedule a demo.