From a nascent Apache project in 2006 to being commercially supported data platform by two public companies, Apache Hadoop has come a long way. Initially adopted by Web 2.0 companies, such as Yahoo, Facebook, LinkedIn & Twitter, Hadoop-based data platforms started becoming a major part in enterprise data infrastructure starting 2011. Today, a majority of Fortune-500 companies have adopted Hadoop data lakes. Maturity of Hadoop-native SQL Query engines (such as Hive & Impala), availability of Hadoop-based BI Platforms & Machine Learning platforms (Apache Spark) have made Hadoop data lakes more accessible to traditional enterprise data practitioners. In addition, availability of hosted Hadoop services on all major public cloud providers have allowed anyone with a credit card to start working with Hadoop, reducing the skills gap.
Having had the fortune of developing Hadoop from day one, while I was at Yahoo, I have witnessed various stages of Hadoop adoption in the enterprises. Most enterprises adopt Hadoop-based platform for its cheap scale-out storage (HDFS), as an “Active Archive” or a staging data store. The second phase of Hadoop adoption is to migrate workloads that can be easily parallelized, and moved away from expensive data warehouses, the most common example being expensive Extract-Transform-Load workloads. The third phase is to perform advanced analytics at scale. Based on conversations with various customers and prospects over the last three years, we believe that most enterprises are somewhere between stage 2 & 3.
Hadoop data lake (either on-premises or in clouds) is where the raw data from other parts of the enterprise first land. These raw datasets may consist of web-server & application server logs, transaction logs from various front-end databases, periodic snapshots from dimensional datasets, events & metrics from hardware or software sensors. IT operations (Data Ops) are responsible for ensuring that the raw data is loaded in a timely manner in these data lakes, and that the data cleansing workloads create “source of truth data sets”. This is the first stage of the ETL pipeline (sometimes called ELT, as raw data is loaded first, and then transformed).
Many transformations on these source-of-truth data sets involve complex joins across multiple datasets, imposing structure on semi-structured/unstructured data (flattening nested data), segmentation (across geographies, demographics, products, time intervals). After these transformations, multiple analyzable data sets are created. Next, pre-written business intelligence queries, and reporting queries are run on these newly created datasets, and the results are uploaded to different systems for serving them to business analysts.
For organizations that have begun to utilize advanced analytics (deep analytics, machine learning etc), these workloads consisting of long-running chains of Hadoop jobs are launched as “cron jobs”. Since most of the data lakes are operated & managed by IT operations, the analytics workloads are carefully vetted to make sure that they execute correctly, and that the data lake has enough capacity to execute those workloads in timely manner, and that these workloads do not bring down the entire cluster or other mission-critical workloads with misbehavior.
Challenges & Deficiencies
In addition to the “production data workloads” that run regularly (say, every hour or every day), there are many other workloads in any organization, which are ad hoc in nature. (Merriam Webster dictionary defines ad hoc as “for the particular end or case at hand without consideration of wider application“.) Many data explorations, visualizations, data science, and machine learned model building workloads are ad hoc, as these workloads do not have a set frequency, nor are they carefully tuned to extract performance.
As Hadoop data lakes become single source of truth for data, various teams in the organization need access the Hadoop data lake in order to perform these ad hoc workloads. These workloads, being experimental in nature, are typically time-bound. A team may run experiments on the subset of data from the data lake for a few weeks or months, and then either throw away their work (in case of unsuccessful experiment), or may productionize it (if successful, and needed long term). The main challenge for the Hadoop data lakes is co-existence of mission-critical production data pipelines, and ad hoc workloads on the same data lakes.
In the early Hadoop deployments at Yahoo, we had to maintain several separate Hadoop clusters, some reserved for production workloads, and some for ad hoc analytical use cases. However, keeping the data synchronized between the two remained a challenge (DistCP, a tool used to copy data between different Hadoop clusters, was the biggest consumer of resources among all users). In addition, because of the burstiness of the ad hoc workloads, either query latencies were unpredictable, or cluster utilization was very low. This situation continues even today, among several enterprises.
Prior to advent of Hadoop data lakes, when a data warehouse was the single source of truth for enterprises, a similar challenge was solved by creating multiple “data marts“, typically, one per division within an organization, to allow ad hoc analytics on a subset of data away from data warehouse. In this usage pattern, Ampool performs the same role of data mart, where Hadoop data lake is the single source of truth.
The above block diagram shows the deployment & data flow, when Ampool is used as a data mart for Hadoop data lakes.
- User specifies tables, partitions, ranges, and fields from data lake catalog
- Ampool executes queries on data lake to bulk load requested data
- (Or) Ampool bulk loads requested Hadoop files directly, enforcing schema on read
- Ampool presents loaded data as partitioned tables of structured/semi-structured data
- Analyst uses Hadoop-native query engines on Ampool to interactively query data
- (Or) Uses Ampool connectors for Python/R to access data
- Most tools & frameworks that integrate with Hadoop query engines can be used
- (Optionally) operational data is integrated in real-time (e.g. slow-changing dimension tables)
- Results published & shared through data lake
Several features in Ampool enable this usage pattern.
- Native integration with Hadoop-native query engines to perform parallel loads from data lakes (Hive, Spark etc.)
- Integration with Hadoop authentication & authorization (Kerberos, LDAP, Apache Sentry)
- High speed data-local connectivity for Hadoop-native query engines (Hive, Spark, Presto, EsgynDB)
- Both Row-oriented & Column-oriented memory & disk layouts to efficiently execute queries
- Support for Hadoop-native file formats (ORC & Parquet)
- Polyglot APIs (Java, REST, Python, R)
- Native integration with Apache Kafka & Apache Apex to rapidly ingest operational data from other data sources
- Linear scalability & automatic load balancing at high performance
- Smart bidirectional tiering to local disks, extending memory beyond working set
- Hadoop-native deployment, management & monitoring (e.g. Cloudera Manager, Apache Ambari)
In addition, Ampool can be deployed on demand using docker containers, and can be orchestrated with popular engines such as Kubernetes. (Deployment with Mesosphere DCOS is on the roadmap.) Since the data stored in Ampool is dynamically updateable, as the underlying data in the Hadoop data lake is changed, those updates can be reflected in-place on the Ampool data mart.
In near future, we are working on simplifying populating the Ampool data mart even more. Our goal is to make it as simple as working with a source code repository (e.g. git checkout, git pull, and git push).
When to consider Ampool as a high-speed data mart for your Hadoop data lake?
You should consider Ampool as an augmentation to your Hadoop data lake, if:
- Your users are demanding access to the production Hadoop data lake for performing ad hoc analytics.
- Your users are already familiar with any of Hadoop-native query engines or compute frameworks, and do not want to learn new query languages only for performing ad hoc analytics.
- These ad hoc analytics workloads, if successful, may need to be deployed some day on production data lakes, using the same Hadoop-native compute frameworks.
- Your business analysts are impatient, and won’s wait months for IT to carve out separate Hadoop clusters, and make data available to them.
- Your users demand predictable latencies for their interactive workloads, and you do not want to spend excessive budget by over-provisioning your data lake.
- You want your ad hoc analytics users to follow the same authentication & authorization & data governance policies as your production data lake.
If you are interested in exploring Ampool for this use case, write to us to schedule a demo.