Skip to main content

Healthcare Data Warehouse CDC Pipeline

Overview

I built a streaming CDC pipeline for healthcare data warehousing using Kafka, Debezium, and Flink, with PostgreSQL as the warehouse sink and Grafana/Loki for operational visibility.

Problem

  • Clinical source systems and analytics storage needed near real-time synchronization.
  • The warehouse model required consistent dimension/fact loading for healthcare workflows.
  • Batch-style SQL sync was too rigid for frequent source updates.
  • The team needed transparent monitoring for stream job behavior and failures.

Solution

  • I set up a Debezium PostgreSQL connector to capture source table changes into Kafka topics.
  • I implemented Flink SQL jobs that consume Debezium JSON events and write transformed data into warehouse tables through JDBC sinks.
  • I structured the target warehouse schema with star-like modeling for analytics (15 dimension tables and 3 fact tables).
  • I split infrastructure into Docker Compose stacks for streaming components and processing/observability services.
  • I integrated Loki log ingestion from Flink via Logback and connected Grafana for log exploration.

Results

  • Working end-to-end CDC flow from Postgres source to Kafka to Flink to warehouse Postgres sink.
  • Streaming enrichment/filtering implemented in Flink SQL for domain-specific mappings (for example, code class filtering before sink writes).
  • Reproducible local environment for Kafka, Debezium Connect, Flink, Grafana, and Loki.
  • Analytics-ready warehouse schema with dimension/fact structure for visit, payment, and prescription domains.

Stack

Kafka, Debezium, Apache Flink (SQL), PostgreSQL, Docker Compose, Grafana, Loki

Architecture

PostgreSQL (source)
-> Debezium PostgreSQL Connector
-> Kafka topics (CDC events)
-> Flink SQL jobs (transform/filter)
-> PostgreSQL (data warehouse schema)
-> Grafana/Loki for logs and observability