Infrastructure
Kafka Streaming Platform
Apache Kafka as a partitioned, replicated commit log. Topic / partition / offset model, consumer groups, exactly-once semantics, and the surrounding ecosystem (Streams, ksqlDB, Flink, Schema Registry).
What It Is
Apache Kafka is a distributed, partitioned, replicated commit log. A topic is an append-only log split into ordered partitions, each replicated across brokers for fault tolerance. Each record in a partition has a monotonic offset; consumers track their position by committing offsets back to Kafka. Producers choose a partition by key (hash) or round-robin; the partition is the unit of order and the unit of parallelism.
Consumers join consumer groups. Kafka assigns partitions to group members so that each partition is read by exactly one consumer in the group at a time, and rebalances on membership change. A topic with partitions can be read in parallel by at most consumers in one group; adding more is wasted. Multiple consumer groups read the same topic independently, each with its own offset state. This is the design that lets one event log fan out to many downstream systems without re-reading from the source.
Kafka separates storage from compute. Brokers do not run user code. Stream processing is delegated to Kafka Streams (JVM client library, embedded in your app), ksqlDB (SQL over streams, server-hosted), Apache Flink (separate cluster, strongest event-time semantics), and Spark Structured Streaming (micro-batch). Confluent Cloud and Amazon MSK are the common managed alternatives to self-hosting.
Delivery semantics: producers default to at-least-once (retries can duplicate). Exactly-once requires the idempotent producer plus transactional writes, supported since 0.11 (2017) for reads and writes within the Kafka cluster. Exactly-once across an external sink requires the sink to participate in the transaction or to be idempotent on the consumer side.
When You'd Use It
Kafka is the default for any system that needs durable, replayable event delivery between services at scale. ML use cases include: streaming features into an online feature store, ingesting click and impression logs for online learning, fanning model-prediction events to logging, monitoring, and feedback systems, and decoupling the model server from upstream producers. The replay property matters: a downstream consumer that breaks for an hour can rewind its offset and reprocess, instead of losing data.
The Schema Registry pattern (Confluent's open-source registry, or Apicurio) stores Avro / Protobuf / JSON Schema definitions out-of-band so producers and consumers can evolve schemas with compatibility checks, instead of embedding schemas in every message. This is essentially mandatory for a multi-team Kafka deployment; without it, a producer schema change silently breaks consumers.
Notable Gotchas
Exactly-once is exactly-once within Kafka, not end-to-end
Kafka's exactly-once guarantees cover producer-to-broker and consumer offset commits inside Kafka transactions. If your consumer reads from Kafka and writes to S3, Postgres, or a feature store, exactly-once requires that external sink to either be idempotent on a deterministic key or to participate in a two-phase commit (Flink's two-phase commit sink, Kafka Connect's exactly-once source connectors). Most teams settle for at-least-once plus idempotent downstream writes.
More partitions is not always better
Each partition costs file descriptors, metadata, and replication overhead on every broker. Tens of thousands of partitions on a small cluster can degrade controller performance. Throughput per partition is often the binding constraint, not partition count: if one partition handles 10 MB/s and you need 100 MB/s, ten partitions suffice. Over-partitioning early to "future-proof" is the common mistake.
References
Related Topics
Last reviewed: April 18, 2026