What is Apache Kafka?

Apache Kafka is a distributed event streaming platform originally developed at LinkedIn and donated to the Apache Software Foundation as an open-source project in 2011. Kafka was specifically designed to reliably receive, store, and forward extremely large volumes of events in real time – with minimal latency even under high load peaks.

Simply put, Kafka works like a highly scalable, durable message log: producers write data into so-called topics, consumers read that data at their own pace. The data is retained in the system for a configurable period of time – unlike traditional message queues, which discard messages after they have been read.

How does Kafka work technically?

Kafka is based on a log-based storage model. Each topic is divided into partitions that are written to sequentially. Each message receives an offset – a sequential position number – which allows consumers to track exactly which messages they have already processed. This model not only enables high write speeds, but also the ability to reprocess data streams retroactively (replay).

A Kafka cluster consists of multiple brokers that together distribute the load and ensure fault tolerance. Since version 2.8, Kafka can be operated without an external ZooKeeper instance (KRaft mode), which significantly reduces operational complexity.

Typical use cases

Real-time data pipelines: Kafka connects source systems (databases, APIs, sensors) with target systems such as data warehouses or analytics platforms.
Event-driven architecture: Microservices communicate asynchronously via Kafka topics, without direct dependencies on one another.
Change Data Capture (CDC): Database changes are captured as events and propagated to downstream systems.
Log aggregation: Logs from distributed systems are collected centrally and made available for analysis.
Stream processing: In combination with Kafka Streams or Apache Flink, data streams can be transformed and enriched directly within Kafka.

Advantages over classic messaging systems

Traditional message brokers such as RabbitMQ or ActiveMQ are optimized for low latency with small message volumes. Kafka, on the other hand, is designed for throughput and persistence. While classic systems delete messages after they are read, Kafka retains them permanently – or until a configurable time limit is exceeded. This makes Kafka the preferred tool when data needs to be processed in parallel by multiple consumers or traced historically.

Kafka also scales horizontally: by adding more brokers and partitions, millions of events per second can be processed – without any changes to the application code.

Kafka in conjunction with the data warehouse

Kafka is not a replacement for a data warehouse, but rather its ideal data supplier. While tools like dbt Core handle data transformation and Data Vault provides the structural modeling paradigm, Kafka ensures that raw data arrives in the warehouse quickly, completely, and in the correct order. The combination of these three technologies today forms the foundation of many modern data architectures.

When is Kafka the right choice?

Kafka is recommended wherever:

high data volumes occur at short intervals,
multiple systems need to consume the same data stream,
the order of events is important,
a replay of events is needed for debugging or reprocessing,
loose coupling between producers and consumers is desired.

For smaller use cases with low data volumes, a simpler message broker may suffice. However, for true real-time data processing in an enterprise environment, Kafka is hard to avoid.