Kafka Streaming Platform

Sneiderman, Robby

Infrastructure

Kafka Streaming Platform

Apache Kafka as a partitioned, replicated commit log. Topic / partition / offset model, consumer groups, exactly-once semantics, and the surrounding ecosystem (Streams, ksqlDB, Flink, Schema Registry).

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Distributed Consensus

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

Apache Kafka is a distributed, partitioned, replicated commit log. A topic is an append-only log split into ordered partitions, each replicated across brokers for fault tolerance. Each record in a partition has a monotonic offset; consumers track their position by committing offsets back to Kafka. Producers choose a partition by key (hash) or round-robin; the partition is the unit of order and the unit of parallelism.

Consumers join consumer groups. Kafka assigns partitions to group members so that each partition is read by exactly one consumer in the group at a time, and rebalances on membership change. A topic with $P$ partitions can be read in parallel by at most $P$ consumers in one group; adding more is wasted. Multiple consumer groups read the same topic independently, each with its own offset state. This is the design that lets one event log fan out to many downstream systems without re-reading from the source.

Kafka separates storage from compute. Brokers do not run user code. Stream processing is delegated to Kafka Streams (JVM client library, embedded in your app), ksqlDB (SQL over streams, server-hosted), Apache Flink (separate cluster, strongest event-time semantics), and Spark Structured Streaming (micro-batch). Confluent Cloud and Amazon MSK are the common managed alternatives to self-hosting.

Delivery semantics: producers default to at-least-once (retries can duplicate). Exactly-once requires the idempotent producer plus transactional writes, supported since 0.11 (2017) for reads and writes within the Kafka cluster. Exactly-once across an external sink requires the sink to participate in the transaction or to be idempotent on the consumer side.

When You'd Use It

Kafka is the default for any system that needs durable, replayable event delivery between services at scale. ML use cases include: streaming features into an online feature store, ingesting click and impression logs for online learning, fanning model-prediction events to logging, monitoring, and feedback systems, and decoupling the model server from upstream producers. The replay property matters: a downstream consumer that breaks for an hour can rewind its offset and reprocess, instead of losing data.

The Schema Registry pattern (Confluent's open-source registry, or Apicurio) stores Avro / Protobuf / JSON Schema definitions out-of-band so producers and consumers can evolve schemas with compatibility checks, instead of embedding schemas in every message. Without it, a producer schema change silently breaks consumers, so any multi-team Kafka deployment needs one.

Notable Gotchas

Watch Out

Exactly-once is exactly-once within Kafka, not end-to-end

Kafka's exactly-once guarantees cover producer-to-broker and consumer offset commits inside Kafka transactions. If your consumer reads from Kafka and writes to S3, Postgres, or a feature store, exactly-once requires that external sink to either be idempotent on a deterministic key or to participate in a two-phase commit (Flink's two-phase commit sink, Kafka Connect's exactly-once source connectors). Most teams settle for at-least-once plus idempotent downstream writes.

Watch Out

More partitions is not always better

Each partition costs file descriptors, metadata, and replication overhead on every broker. Tens of thousands of partitions on a small cluster can degrade controller performance. Throughput per partition is often the binding constraint, not partition count: if one partition handles 10 MB/s and you need 100 MB/s, ten partitions suffice. Over-partitioning early to "future-proof" is the common mistake.

References

Apache Kafka, Documentation: Design (https://kafka.apache.org/documentation/#design).
Kreps, Narkhede, Rao, Kafka: A Distributed Messaging System for Log Processing (NetDB 2011).
Confluent, Schema Registry Documentation (https://docs.confluent.io/platform/current/schema-registry/index.html).
Wang et al., Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka (SIGMOD 2021).
Apache Flink, Stream Processing with Apache Flink (https://nightlies.apache.org/flink/flink-docs-stable/).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

Distributed Consensuslayer 3 · tier 2
Distributed Training Theorylayer 5 · tier 3

Graph-backed continuations

Distributed Consensus Distributed Training Theory