Organizations today build data lakes to process, manage and store large amounts of data that originate from different sources both on-premise and on cloud. As part of their data lake strategy, organizations want to leverage some of the leading OSS frameworks such as Apache Spark for data processing, Presto as a query engine and Open Formats for storing data such as Delta Lake for the flexibility to run anywhere and avoiding lock-ins.
Traditionally, some of the major challenges with building and deploying such an architecture were:
- Object Storage was not well suited for handling mutating data and engineering teams spent a lot of time in building workarounds for this
- Google Cloud provided the benefit of running Spark, Presto and other varieties of clusters with the Dataproc service, but one of the challenges with such deployments was the lack of a central Hive Metastore service which allowed for sharing of metadata across multiple clusters.
- Lack of integration and interoperability across different Open Source projects
To solve for these problems, Google Cloud and the Open Source community now offers:
- Native Delta Lake support in Dataproc, a managed OSS Big Data stack for building a data lake with Google Cloud Storage, an object storage that can handle mutations
- A managed Hive Metastore service called Dataproc Metastore which is natively integrated with Dataproc for common metadata management and discovery across different types of Dataproc clusters
- Spark 3.0 and Delta 0.7.0 now allows for registering Delta tables with the Hive Metastore which allows for a common metastore repository that can be accessed by different clusters.
Here’s what a standard Open Cloud Datalake deployment on GCP might consist of:
- Apache Spark running on Dataproc with native Delta Lake Support
- Google Cloud Storage as the central data lake repository which stores data in Delta format
- Dataproc Metastore service acting as the central catalog that can be integrated with different Dataproc clusters
- Presto running on Dataproc for interactive queries