How to build an open cloud datalake with Delta Lake, Presto & Dataproc MetastoreHow to build an open cloud datalake with Delta Lake, Presto & Dataproc MetastorePrincipal Architect

Organizations today build data lakes to process, manage and store large amounts of data that  originate from different sources both on-premise and on cloud. As part of their data lake strategy, organizations want to leverage some of the leading OSS frameworks such as Apache Spark for data processing, Presto as a query engine and Open Formats for storing data such as Delta Lake for the flexibility to run anywhere and avoiding lock-ins.

Traditionally, some of the major challenges with building and deploying such an architecture were:

  • Object Storage was not well suited for handling mutating data and engineering teams spent a lot of time in building workarounds for this
  • Google Cloud provided the benefit of running Spark, Presto and other varieties of clusters with the Dataproc service, but one of the challenges with such deployments was the lack of a central Hive Metastore service which allowed for sharing of metadata across multiple clusters.
  • Lack of integration and interoperability across different Open Source projects

To solve for these problems, Google Cloud and the Open Source community now offers:

  • Native Delta Lake support in Dataproc, a managed OSS Big Data stack for building a data lake with Google Cloud Storage, an object storage that can handle mutations
  • A managed Hive Metastore service called Dataproc Metastore which is natively integrated with Dataproc for common metadata management and discovery across different types of Dataproc clusters
  • Spark 3.0 and Delta 0.7.0 now allows for registering Delta tables with the Hive Metastore which allows for a common metastore repository that can be accessed by different clusters.

Architecture

Here’s what a standard Open Cloud Datalake deployment on GCP might consist of:

  1. Apache Spark running on Dataproc with native Delta Lake Support
  2. Google Cloud Storage as the central data lake repository which stores data in Delta format
  3. Dataproc Metastore service acting as the central catalog that can be integrated with different Dataproc clusters
  4. Presto running on Dataproc for interactive queries