Apache Kafka - The Architecture to Handle Billions of Events Per Day

The modern digital world operates on an unprecedented volume of data. Every click, transaction, sensor reading, and user interaction is an event, and the systems designed to process them must handle this stream in real-time, reliably, and at massive scale.

The Scale Challenge

• LinkedIn, Kafka’s creator, processes over 7 trillion messages per day.
• Uber handles trillions of messages and multiple petabytes of data daily, from calculating ETAs to driver-rider matching.
• FinTech companies handle over a billion events every day to process transactions and real-time monitoring. In this deep dive, we will move past the basic definitions of Kafka and explore the specific architectural patterns and configuration tuning strategies that allow Kafka to manage billions of events per day.

The Three Pillars of Kafka Scalability

Kafka’s remarkable scaling ability is rooted in a design philosophy that treats data not as records in a database, but as an immutable, distributed commit log. This simple, yet powerful, concept allows it to sustain immense throughput.

1. Topics and Partitions

Topics are the logical streams of data. Partition is the core unit of parallelism and storage in Kafka. Each partition is an ordered, append-only sequence of records.

Scaling Kafka begins with the partition count. If a topic has N partitions, the data can be distributed and processed by up to N separate machines (brokers and consumers) in parallel.

2. Kafka Brokers and Clusters

A large-scale Kafka environment is built on a distributed cluster of servers.

• Kafka Broker: A single Kafka server. It manages the partitions and handles all network traffic (producing, consuming, and replicating data).
• Kafka Cluster: A collection of brokers working together. Scaling out horizontally simply means adding more brokers to the cluster.

Replication

To prevent data loss and ensure high availability, each partition is typically replicated across multiple brokers. This is defined by the Replication Factor.

• One broker holds the Leader copy of a partition, which handles all reads and writes.
• The other brokers hold Follower copies, which asynchronously replicate data from the leader.

3. Consumer Groups

While partitions enable parallel writing (producer side), Consumer Groups enable parallel reading. Key Rule: Within a consumer group, each partition is assigned to exactly one consumer instance. This ensures that every message is processed exactly once by the group.

The maximum parallel throughput for a consumer group is therefore fundamentally capped by the number of partitionsin the topic.

Scaling Strategies: Tuning for Billions

While the above architecture provides the foundation, achieving truly massive scale - handling billions of events - requires careful tuning of client and broker configurations. This is where engineers differentiate an average Kafka setup from a world-class high-throughput engine.

Topic Design: The Partition Count Paradox

• The Problem: Too few partitions limit your maximum parallel processing rate. Too many partitions introduce significant metadata management overhead for the brokers, especially during cluster restarts or consumer rebalancing.
• The Strategy: Start with a rational estimate based on your expected throughput and processing speed. A common rule of thumb is to aim for a sufficient partition count so that each partition is not receiving more messages than a single consumer instance can handle at peak load.
• Key Consideration: Aim for a partition count that is a multiple of the number of brokers for even distribution.

Producer Tuning: Maximizing Ingestion Throughput

The Producer API is designed to maximize the rate at which data is ingested by grouping messages together. Lets discuss the configurations to achieve this:

• batch.size - The maximum amount of data (in bytes) the producer will gather before sending a request to a broker.
• linger.ms - The time (in milliseconds) the producer waits for more records to accumulate into a batch before sending it, even if batch.size isn’t reached.
• compression.type - The type of compression applied to the batches (e.g., lz4, zstd)
• acks - The durability level for the message. acks=1 ensures the leader received the message before acknowledging.

Consumer Tuning: Optimizing Processing Speed

On the read side, the goal is to fetch data efficiently and prevent unnecessary delays. Configurations to achieve this:

• fetch.max.bytes - The maximum size of data the consumer will attempt to fetch from a single broker in a single request.
• max.poll.records - The maximum number of records returned in a single call to poll().

Broker and Hardware Optimization

Kafka’s performance relies heavily on how the operating system handles disk and memory.

• Sequential I/O: Kafka writes data sequentially to the end of the commit log file. This is fast, even on traditional HDDs, but modern high-IOPS SSDs are highly recommended for billions-scale applications.
• OS Page Cache: Kafka brokers are designed to aggressively use the RAM. Allocate as much RAM as possible to the brokers (leaving room for the JVM heap) and rely on the OS to handle caching.
• Network: At billions of events per day, the network becomes the biggest bottleneck. Ensure your brokers are connected with low latency connections. Luckily cloud or managed brokers take care of this.

Common Pitfalls to Avoid

A high-performing Kafka cluster can still be crippled by poor client design. Here are the most common scaling pitfalls.

• The Hot Partition Problem - Imbalanced key usage causes data skew, overloading a single partition and its assigned consumer.
• Consumer Rebalancing Storms - Frequent consumer time out triggers continuous rebalances.
• Downstream Bottleneck - Kafka is fast, but the target system often cannot keep up with the consumer’s fetch rate, leading to accumulated lag.