Healthcare Data Warehouse CDC Pipeline

Overview

I built a streaming CDC pipeline for healthcare data warehousing using Kafka, Debezium, and Flink, with PostgreSQL as the warehouse sink and Grafana/Loki for operational visibility.

Problem

Clinical source systems and analytics storage needed near real-time synchronization.
The warehouse model required consistent dimension/fact loading for healthcare workflows.
Batch-style SQL sync was too rigid for frequent source updates.
The team needed transparent monitoring for stream job behavior and failures.

Solution

I set up a Debezium PostgreSQL connector to capture source table changes into Kafka topics.
I implemented Flink SQL jobs that consume Debezium JSON events and write transformed data into warehouse tables through JDBC sinks.
I structured the target warehouse schema with star-like modeling for analytics (15 dimension tables and 3 fact tables).
I split infrastructure into Docker Compose stacks for streaming components and processing/observability services.
I integrated Loki log ingestion from Flink via Logback and connected Grafana for log exploration.

Results

Working end-to-end CDC flow from Postgres source to Kafka to Flink to warehouse Postgres sink.
Streaming enrichment/filtering implemented in Flink SQL for domain-specific mappings (for example, code class filtering before sink writes).
Reproducible local environment for Kafka, Debezium Connect, Flink, Grafana, and Loki.
Analytics-ready warehouse schema with dimension/fact structure for visit, payment, and prescription domains.

Stack

Kafka, Debezium, Apache Flink (SQL), PostgreSQL, Docker Compose, Grafana, Loki

Architecture

PostgreSQL (source)
  -> Debezium PostgreSQL Connector
    -> Kafka topics (CDC events)
      -> Flink SQL jobs (transform/filter)
        -> PostgreSQL (data warehouse schema)
          -> Grafana/Loki for logs and observability

Overview​

Problem​

Solution​

Results​

Stack​

Architecture​