Kafka, Samza, Together With The Unix Philosophy Of Distributed Data

This newspaper is really related to the "Realtime Data Processing at Facebook" newspaper I reviewed inwards my previous post. As I mentioned at that spot Kafka does basically the same matter equally Facebook's Scribe, in addition to Samza is a current processing scheme on Kafka.

This newspaper is really slowly to read. It is delightful inwards its simplicity. It summarizes the pattern of Apache Kafka in addition to Apache Samza in addition to compares their pattern principles to the pattern philosophy of Unix, inwards particular, Unix pipes.

Who says plumbing can't endure sexy? (Seriously, don't Google this.) So without farther ado, I introduce to y'all Mike Rowe of distributed systems.

Motivation/Applications

I had talked virtually the motivation in addition to applications of current processing inwards the Facebook post. The application domain is basically edifice spider web services that conform to your behavior in addition to personalize on the fly, including Facebook, Quora, Linkedin, Twitter, Youtube, Amazon, etc. These webservices cause got inwards your most recent actions (likes, clicks, tweets), analyze it on the fly, merge amongst previous analytics on larger data, in addition to conform to your recent activity equally business office of a feedback loop.

In theory y'all tin accomplish this personalization finish amongst a batch workflow system, similar MapReduce, which provides scheme scalability, organizational scalability (that of the engineering/development team's efforts), operational robustness, multi-consumer support, loose coupling, information provenance, in addition to friendliness to experimentation. However, batch processing volition add together large delays. Stream processing systems save all the skillful scalability features of batch workflow systems, in addition to add together "timeliness" characteristic equally well.

I am using shortened descriptions from the newspaper for the next sections.

Apache Kafka

Kafka provides a publish-subscribe messaging service. Producer (publisher) clients write messages to a named topic, in addition to consumer (subscriber) clients read messages inwards a topic. A theme is divided into partitions, in addition to messages inside a partitioning are totally ordered. There is no ordering guarantee across unlike partitions. The role of partitioning is to supply horizontal scalability: unlike partitions tin reside on unlike machines, in addition to no coordination across partitions is required.

Each partitioning is replicated across multiple Kafka broker nodes to tolerate node failures. One of a partition's replicas is chosen equally leader, in addition to the leader handles all reads in addition to writes of messages inwards that partition. Writes are serialized past times the leader in addition to synchronously replicated to a configurable issue of replicas. On leader failure, ane of the in-sync replicas is chosen equally the novel leader.

The throughput of a unmarried topic-partition is express past times the computing resources of a unmarried broker node --the bottleneck is unremarkably either its NIC bandwidth or the sequential write throughput of the broker's disks. When adding nodes to a Kafka cluster, some partitions tin endure reassigned to the novel nodes, without changing the issue of partitions inwards a topic. This rebalancing technique allows the cluster's computing resources to endure increased or decreased without affecting partitioning semantics.

Apache Samza

A Samza labor consists of a Kafka consumer, an outcome loop that calls application code to procedure incoming messages, in addition to a Kafka producer that sends output messages dorsum to Kafka. Unlike many other stream-processing frameworks, Samza does non implement its ain network protocol for transporting messages from ane operator to another.

Figure iii illustrates the utilization of partitions inwards the word-count example: past times using the tidings equally message key, the SplitWords labor ensures that all occurrences of the same tidings are routed to the same partitioning of the words topic.

Samza implements durable solid set down through the KeyValueStore abstraction, exemplified inwards Figure 2. Samza uses the RocksDB embedded key-value store, which provides low-latency, high-throughput access to information on local disk. To brand the embedded shop durable inwards the human face upwards of disk in addition to node failures, every write to the shop (i.e., the changelog) is too sent to a dedicated topic-partition inwards Kafka, equally illustrated inwards Figure 4. When recovering later a failure, a labor tin rebuild its shop contents past times replaying its partitioning of the changelog from the beginning. Rebuilding a shop from the log is exclusively necessary if the RocksDB database is lost or corrupted. While the changelog publishing to Kafka for durability seems wasteful, it tin too endure a useful characteristic for applications: other current processing jobs tin eat the changelog theme similar whatsoever other stream, in addition to utilization it to perform farther computations.

One characteristic shape of stateful processing is a bring together of ii or to a greater extent than input streams, most commonly an equi-join on a key (e.g. user ID). One type of bring together is a window join, inwards which messages from input streams A in addition to B are matched if they cause got the same key, in addition to occur inside some fourth dimension interval delta-t of ane another. Alternatively, a current may endure joined against tabular data: for example, user clickstream events could endure joined amongst user profile data, producing a current of clickstream events amongst embedded information virtually the user. When joining amongst a table, the authors recommend to brand the tabular array information available inwards the shape of a log-compacted current through Kafka. Processing tasks tin eat this current to cook an in-process replica of a database tabular array partition, using the same approach equally the recovery of durable local state, in addition to so interrogation it amongst depression latency. It seems wasteful to me, but it looks similar the authors hit non experience worried virtually straining Kafka, in addition to are comfortable amongst using Kafka equally a operate horse.

Even though the intermediate solid set down betwixt ii Samza current processing operators is ever materialized to disk, Samza is able to supply skillful performance: a unproblematic current processing labor tin procedure over 1 ane one 1000 thousand messages per mo on ane machine, in addition to saturate a gigabit Ethernet NIC.

Discussion

The newspaper includes a prissy tidings department equally well.

Since the exclusively access methods supported past times a log are an appending write in addition to a sequential read from a given offset, Kafka avoids the complexity of implementing random-access indexes. By doing less work, Kafka is able to supply much ameliorate performance than systems amongst richer access methods. Kafka's focus on the log abstraction is reminiscent of the Unix philosophy: "Make each programme hit ane matter well. To hit a novel job, cook afresh rather than complicate onetime programs past times adding novel features."
If Kafka is similar a streaming version of HDFS, so Samza is similar a streaming version of MapReduce. The pipeline is loosely coupled, since a labor does non know the identity of the jobs upstream or downstream from it, exclusively the theme names. This regulation over again evokes a Unix maxim: “Expect the output of every programme to conk the input to another, equally even so unknown, program.”
There are some key differences betwixt Kafka topics in addition to Unix pipes: A theme tin cause got whatsoever issue of consumers that hit non interfere amongst each other, it tolerates failure of producers, consumers or brokers, in addition to a theme is a named entity that tin endure used for tracing information provenance. Kafka topics deliberately hit non supply backpressure: the on-disk log acts equally an almost-unbounded buffer of messages.
The log-oriented model of Kafka in addition to Samza is fundamentally built on the thought of composing heterogeneous systems through the uniform interface of a replicated, partitioned log. Individual systems for information storage in addition to processing are encouraged to hit ane matter well, in addition to to utilization logs equally input in addition to output. Even though Kafka's logs are non the same equally Unix pipes, they encourage composability, in addition to therefore Unix-style thinking.