Healthcare Data Warehouse CDC Pipeline
Overview
I built a streaming CDC pipeline for healthcare data warehousing using Kafka, Debezium, and Flink, with PostgreSQL as the warehouse sink and Grafana/Loki for operational visibility.
Problem
- Clinical source systems and analytics storage needed near real-time synchronization.
- The warehouse model required consistent dimension/fact loading for healthcare workflows.
- Batch-style SQL sync was too rigid for frequent source updates.
- The team needed transparent monitoring for stream job behavior and failures.
Solution
- I set up a Debezium PostgreSQL connector to capture source table changes into Kafka topics.
- I implemented Flink SQL jobs that consume Debezium JSON events and write transformed data into warehouse tables through JDBC sinks.
- I structured the target warehouse schema with star-like modeling for analytics (
15dimension tables and3fact tables). - I split infrastructure into Docker Compose stacks for streaming components and processing/observability services.
- I integrated Loki log ingestion from Flink via Logback and connected Grafana for log exploration.
Results
- Working end-to-end CDC flow from Postgres source to Kafka to Flink to warehouse Postgres sink.
- Streaming enrichment/filtering implemented in Flink SQL for domain-specific mappings (for example, code class filtering before sink writes).
- Reproducible local environment for Kafka, Debezium Connect, Flink, Grafana, and Loki.
- Analytics-ready warehouse schema with dimension/fact structure for visit, payment, and prescription domains.
Stack
Kafka, Debezium, Apache Flink (SQL), PostgreSQL, Docker Compose, Grafana, Loki
Architecture
PostgreSQL (source)
-> Debezium PostgreSQL Connector
-> Kafka topics (CDC events)
-> Flink SQL jobs (transform/filter)
-> PostgreSQL (data warehouse schema)
-> Grafana/Loki for logs and observability