Article/BlogLatest Post

Data warehouse and Data Lake Modernization

92views

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

 
 By Kai Wahner                            Published: 7 Oct 2022

The concepts and architectures of a data warehouse, data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in this article and learn how to build a modern data stack with cloud-native technologies.

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. What does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

Let’s explore these steps in more detail and understand the technology and architecture options.

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too.

The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native. Tools like Kafka provide a reliable and scalable integration layer to get all required data into the data warehouse.

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable. 

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern and legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events.

Leave a Response