What Kafka Is

Apache Kafka is a distributed event streaming platform used as:

  • A message queue (async processing)
  • A stream (real-time continuous data)

Summary

Kafka is a distributed append-only log built for high throughput, scalability, and durability.

Analogy: Kafka is like a shared ledger where events are written once and readers independently replay it at their own speed.


Motivating Example (Why Kafka Exists)

World Cup live stats:

  • Goals, fouls, substitutions → events
  • Events must be ordered per match
  • Massive traffic spikes
  • Consumers must scale horizontally

Problems Kafka solves:

  • Ordering
  • Scalability
  • Fault tolerance

Core Kafka Concepts

Cluster & brokers

  • cluster = multiple brokers
  • broker = a server storing data & serving clients

Important

More brokers = more storage + more throughput.


Topics vs partitions

  • topic: logical stream of messages
  • partition: physical, ordered, immutable log
conceptpurpose
topicdata organization
partitionscale + ordering
brokerhosts partitions

Important

Ordering is guaranteed only within a partition.


Producers & consumers

  • producer → writes messages
  • consumer → reads messages
  • Kafka is data-agnostic (schema doesn’t matter)

Message anatomy

A Kafka message contains:

  • value (required)
  • Key (optional, affects partitioning)
  • Timestamp (optional)
  • Headers (metadata)

Tip

The key is the most important design decision.


Partitioning Logic (Critical)

(flow) producer → hash(key) → partition → broker

Rules:

  • Same key → same partition → ordered
  • No key → round-robin → no ordering

Ordering example (banking events):

  • Same user performs deposit(100) and then withdraw(50).
  • These events must be processed in the same order to keep balance correct.
  • Use userId as the partition key so both events go to the same partition.
  • Kafka guarantees order within a partition, so per-user balance updates remain consistent.

Summary

Partitioning controls ordering, parallelism, and hotspots.


Append-Only Log Model

Each partition:

  • Append-only
  • Immutable
  • Sequential writes
  • Offset-based reads

Benefits:

  • Fast disk IO
  • Easy replication
  • Simple recovery

Important

Kafka is fast because it never updates data, only appends.


Offsets & Consumption

  • Every message has an offset
  • Consumers track progress via offsets
  • Offsets are committed, not messages acknowledged

Important

Kafka tracks where you are, not what you processed.


Replication & Durability

Leader–follower model

  • Each partition has:
    • 1 leader
    • N−1 followers
  • Leader handles reads & writes
  • Followers replicate passively

Summary

Replication protects against broker failure, not bad data.


Acknowledgements (acks)

  • acks=all → leader + all replicas confirm

Trade-off:

  • Durability ↑
  • Latency ↑

Replication factor

  • Default = 3
  • Survives broker failure
  • Costs disk space
factordurabilitystorage
1lowlow
3highhigh

Consumer model

Kafka uses pull-based consumption:

  • Consumers poll at their own pace
  • Prevents slow consumers from being overwhelmed
  • Enables batching

Analogy: consumers sip from a river, Kafka never force-feeds.


Consumer Groups

  • Group = multiple consumers sharing work
  • One partition → one consumer per group

Important

Same message can be read by multiple consumer groups.


Consumer Failure Handling

Offset recovery

  • Offsets committed after processing
  • Restart → resume from last committed offset

Rebalancing

  • Partitions redistributed across alive consumers

Warning

Commit offsets after critical work, or you risk data loss.


Kafka: Queue vs Stream

message queuestream
async taskscontinuous processing
usually one consumer groupmany consumer groups
pull + ackreplayable log

Summary

Kafka itself is the same — consumer behavior changes.


When to Use Kafka

Use kafka as a queue when:

  • Async processing (youtube transcoding)
  • Ordering required (ticketmaster queue)
  • Producer & consumer must scale independently

Use kafka as a stream when:

  • Real-time analytics (ad clicks)
  • Fan-out to many consumers (fb live comments)

Scalability Basics

Message size

  • Recommended < 1MB
  • Kafka ≠ blob storage

Correct pattern:

  • Blobs → S3
  • Kafka → pointer (URL)

Warning

Storing large blobs in Kafka is an anti-pattern.


Broker capacity (rough)

  • ~1TB storage
  • ~1M msgs/sec (hardware dependent)

If below this → scaling discussion may not be needed.


Scaling Strategies

Add brokers

  • Increases capacity
  • Requires enough partitions

Important

Under-partitioned topics cannot use new brokers.


Partitioning strategy (most important)

  • Partition = hash(key) % partitions
  • Good key → even distribution
  • Bad key → hot partitions

Handling Hot Partitions

Example: viral ad in click aggregator

Solutions:

  • No key → random distribution, no ordering
  • Salting → adId + random
  • Compound key → adId:userId or adId:region
  • Backpressure → temporarily slow producers when lag and retries rise

Summary

Hot partitions are solved by breaking key cardinality.


Retries & Errors

Producer retries

  • Automatic retries
  • Enable idempotence to avoid duplicates

Example (payments):

  • Risky event design: increaseBalanceBy(+1 INR)
  • Safer event design: setBalanceTo(50 INR) with a unique operation ID

If producer sends the message, times out, and retries:

  • increaseBalanceBy(+1 INR) may be applied twice (wrong balance)
  • setBalanceTo(50 INR) applied twice still ends at 50 INR (same final state)

This is why idempotency matters: retries are expected, duplicates are possible, and business state must stay correct.

Important

Idempotent producers are mandatory with retries.


Consumer retries

Kafka has no built-in consumer retry.

Pattern: main topic → retry topic → DLQ

Important behavior:

  • Messages do not go to DLQ immediately on first failure.
  • They are retried up to a configured max retries count (often with delay/backoff).
  • Only after retries are exhausted are they moved to the DLQ.

Tip

If retries are core, SQS may be simpler than Kafka.


Performance Optimizations

  • Batching → fewer network calls
  • Compression → less bandwidth
  • Partition key → biggest performance lever

Summary

Partitioning beats hardware for Kafka performance.


Retention Policies

Messages deleted by:

  • Retention.ms (default 7 days)
  • Retention.bytes

Important

Messages are not deleted after consumption.


Key takeaways

  • “Kafka is always available, sometimes consistent”
  • “Ordering is guaranteed per partition”
  • “Offsets track progress, not acknowledgements”
  • “Partitioning is the core scaling decision”

Final Takeaway

Summary

Kafka is a distributed, append-only log optimized for ordered, durable, high-throughput event processing. Focus on partitioning, hot partitions, durability trade-offs, and consumer behavior.