Paper Summary: Kafka

Reading time ~8 minutes

Overview

Let’s review the different sections of the original Kafka paper. With many years in development, a few think have changed now, which I’ll briefly mention.


Tl;Dr and Key Features

Kafka is a high-performance real-time, scalable/distributed and fault-tolerant pub-sub(publish-subscribe) system.

Caveat: It wasn’t fault-tolerant at the time of the paper, but is now.
The key features that the paper describes:

  • A pull-based model, to allow consumers to process data at their own pace.
  • Ability to scale publishers and consumers horizontally.
  • Awkward design decisions at the time geared towards high performance(eg. no software caching).
  • One of the first systems focussed towards real-time streaming workloads.

Basic Concepts and a Bird's Eye View

  • Message Payload: A raw byte stream representing the actual message - generally serialized-deserialized using Avro(or other formats) and sent in batches.
  • Broker: A Node in Cluster, containing several topic-partitions.
  • Partition: Basic unit of parallelism - Each can be thought of as an append-only log file. While the original paper didnt have this, currently Kafka has a concept of a leader partition(in green) which is then replicated(in gray). Note again, that replication is a more recent addition. By distributing these across brokers Kafka allows horizontal scaling of message queueing.
  • Topic: Group of partitions representing a single topic/feed of interest.
  • Producer: Message publisher/source - When submitting messages to a topic, Kafka automatically ensures it round robins between leader partitions to aid scaling.
  • Consumer: Subscribes to a one or more topic-partitions to read messages from.
  • Consumer-Group: A group of consumers which can be thought to represent one entity interested in a given topic. Depending on the number of consumers needed to keep up with the topic, the group size varies. A consumer group allows horizontal scaling of message dequeuing.

Performance Driven Decisions

  1. Segment Files: As mentioned earlier, each partition can be thought of as an append-only immutable log. Actually, once a given log file(called a segment file) reaches a preconfigured size(generally 1G), a new log file is created. Again, in a precofigured interval of a few days each log file is automatically deleted. Each entry in the segment file has no specific message id, but instead an incrementing offset. Each segment file is named according to the first message offset. By preventing deletions or other mutable operations, the design is simplified and made amenable to data-parallelism.
  2. In-memory Index: Hash-Index over currently existing log files containing entries resembling Log-Offset -> (File, Position in file). Only entries for 1 in N messages is stored in the index. In the diagram it is 1 per log file. A simple index, ensures less compute wasted on repeated random lookups/seeks.
  3. Batched Pull Requests: Batching messages allows peak-bandwidth use. Pulling ensures that the consumers are able to handle the frequency.
  4. Page Cache Reliance: Since the OS performs write through caching by default, the messages are not cached in a software layer so as to speed up real-time read-writes. This prevents double buffering and speeds up the system again.
  5. Kernel Bypass Copying: Using the sendfile() API to bypass the kernel when copying data - it directly transfers bytes from a file to a socket channel.
  6. Stateless Brokers: Clients are responsible for committing/storing information about logical offsets already consumed. Newer versions of Kafka allow clients to store this information in a special ___consumer_offsets topic. Making brokers independent of this allows them to be simple and again amenable to scaling. This topic can then be used to rewind to older messages easily too.

Distributed Coordination

  • Partitions are the units of parallelism.
  • Producers can scale up by parallelizing message transfer across partitions.
  • Consumers can scale up by forming Consumer groups reading from a multiple partitions representative of the topic at the same time. Note that there are no ordering guarantees as soon as we have more than one partition for a topic.
  • Zookeeper is used to track membership of producers/consumers and help manage/trigger any rebalance across partitions needed. The consumer is the one actually doing the rebalance.

Reliability Guarantees

  • Guarantees At-least once delivery i.e. message will be delivered and processed by the consumer at least once. This is done by having the consumer commit the last processed offset to the __consumer_offsets topic only after it has processed it.
  • CRC is stored per message to detect corruption.
  • Leader-Slave replication of partitions to recover in case of failure.

Case Study: LinkedIn

The Frontend captures usage logs that are dumped into corresponding topics via a Load balancer. Realtime services can directly subscribe and process messages. An offline set of embedded consumers pull messages into another set of brokers and dump it into Hadoop or other distributed storage for offline analysis.


Benchmarks

  1. Producer outperforms competitors because of batched, serialized messages with a simple index ensure maximum tput. ActiveMQ for example required significant CPU for BTree lookups.
  2. Consumer outperforms competitors because of the same effective message storage as above, sendfile API and reduced disk usage.

Note: A number of images were taken from the paper itself.
That’s all, thanks for reading :) - You can post a comment or reach out to me at saatvikshah1994@gmail.com.

Translation Units, Linkage, Inlining and Statics: A Collective Reference

Compilation of interrelated introductory topics mentioned in the title with useful references roughly organized in a GoogleDoc and kept up to date. Continue reading

Rule of 0-6 and Tips on General Class Design

Published on April 18, 2019

Templates: A 101 Reference

Published on January 21, 2019