What Kafka Is
Apache Kafka is a distributed event streaming platform used as:
- a message queue (async processing)
- a stream (real-time continuous data)
Summary
Kafka is a distributed append-only log built for high throughput, scalability, and durability.
Analogy: Kafka is like a shared ledger where events are written once and readers independently replay it at their own speed.
Motivating Example (Why Kafka Exists)
World Cup live stats:
- goals, fouls, substitutions → events
- events must be ordered per match
- massive traffic spikes
- consumers must scale horizontally
Problems Kafka solves:
- ordering
- scalability
- fault tolerance
Core Kafka Concepts
Cluster & brokers
- cluster = multiple brokers
- broker = a server storing data & serving clients
Important
More brokers = more storage + more throughput.
Topics vs partitions
- topic: logical stream of messages
- partition: physical, ordered, immutable log
| concept | purpose |
|---|---|
| topic | data organization |
| partition | scale + ordering |
| broker | hosts partitions |
Important
Ordering is guaranteed only within a partition.
Producers & consumers
- producer → writes messages
- consumer → reads messages
- Kafka is data-agnostic (schema doesn’t matter)
Message anatomy
A Kafka message contains:
- value (required)
- key (optional, affects partitioning)
- timestamp (optional)
- headers (metadata)
Tip
The key is the most important design decision.
Partitioning Logic (Critical)
(flow) producer → hash(key) → partition → broker
Rules:
- same key → same partition → ordered
- no key → round-robin → no ordering
Summary
Partitioning controls ordering, parallelism, and hotspots.
Append-Only Log Model
Each partition:
- append-only
- immutable
- sequential writes
- offset-based reads
Benefits:
- fast disk IO
- easy replication
- simple recovery
Important
Kafka is fast because it never updates data, only appends.
Offsets & Consumption
- every message has an offset
- consumers track progress via offsets
- offsets are committed, not messages acknowledged
Important
Kafka tracks where you are, not what you processed.
Replication & Durability
Leader–follower model
- each partition has:
- 1 leader
- N−1 followers
- leader handles reads & writes
- followers replicate passively
Summary
Replication protects against broker failure, not bad data.
Acknowledgements (acks)
acks=all→ leader + all replicas confirm
Trade-off:
- durability ↑
- latency ↑
Replication factor
- default = 3
- survives broker failure
- costs disk space
| factor | durability | storage |
|---|---|---|
| 1 | low | low |
| 3 | high | high |
Consumer model
Kafka uses pull-based consumption:
- consumers poll at their own pace
- prevents slow consumers from being overwhelmed
- enables batching
Analogy: consumers sip from a river, Kafka never force-feeds.
Consumer Groups
- group = multiple consumers sharing work
- one partition → one consumer per group
Important
Same message can be read by multiple consumer groups.
Consumer Failure Handling
Offset recovery
- offsets committed after processing
- restart → resume from last committed offset
Rebalancing
- partitions redistributed across alive consumers
Warning
Commit offsets after critical work, or you risk data loss.
Kafka: Queue vs Stream
| message queue | stream |
|---|---|
| async tasks | continuous processing |
| usually one consumer group | many consumer groups |
| pull + ack | replayable log |
Summary
Kafka itself is the same — consumer behavior changes.
When to Use Kafka
Use kafka as a queue when:
- async processing (youtube transcoding)
- ordering required (ticketmaster queue)
- producer & consumer must scale independently
Use kafka as a stream when:
- real-time analytics (ad clicks)
- fan-out to many consumers (fb live comments)
Scalability Basics
Message size
- recommended < 1MB
- Kafka ≠ blob storage
Correct pattern:
- blobs → S3
- Kafka → pointer (URL)
Warning
Storing large blobs in Kafka is an anti-pattern.
Broker capacity (rough)
- ~1TB storage
- ~1M msgs/sec (hardware dependent)
If below this → scaling discussion may not be needed.
Scaling Strategies
Add brokers
- increases capacity
- requires enough partitions
Important
Under-partitioned topics cannot use new brokers.
Partitioning strategy (most important)
- partition = hash(key) % partitions
- good key → even distribution
- bad key → hot partitions
Handling Hot Partitions
Example: viral ad in click aggregator
Solutions:
- no key → random distribution, no ordering
- salting → adId + random
- compound key → adId:userId or adId:region
- backpressure → slow producer
Summary
Hot partitions are solved by breaking key cardinality.
Retries & Errors
Producer retries
- automatic retries
- enable idempotence to avoid duplicates
Important
Idempotent producers are mandatory with retries.
Consumer retries
Kafka has no built-in consumer retry.
Pattern: main topic → retry topic → DLQ
Tip
If retries are core, SQS may be simpler than Kafka.
Performance Optimizations
- batching → fewer network calls
- compression → less bandwidth
- partition key → biggest performance lever
Summary
Partitioning beats hardware for Kafka performance.
Retention Policies
Messages deleted by:
- retention.ms (default 7 days)
- retention.bytes
Important
Messages are not deleted after consumption.
Key takeaways
- “Kafka is always available, sometimes consistent”
- “Ordering is guaranteed per partition”
- “Offsets track progress, not acknowledgements”
- “Partitioning is the core scaling decision”
Final Takeaway
Summary
Kafka is a distributed, append-only log optimized for ordered, durable, high-throughput event processing. Focus on partitioning, hot partitions, durability trade-offs, and consumer behavior.