Skip to content

Month: November 2020

Pulsar – The New Kafka?

I’ve been following Apache Pulsar for a while now, reading what ever articles I come by. I thought I should write a short summary of my thoughts about this “new”, exiting, distributed message streaming platform.

Kafka

I have gained experience working with Apache Kafka for some time, and it does have a lot in common with Pulsar. It has been around for a couple of more years, and I think it is a good message streaming platform. I wrote a protocol definition of the Kafka protocol earlier this year together with an in-memory test framework including an example on how you can test applications that for example uses Confluent’s .NET Kafka client. You can read all about it here.

Pulsar vs Kafka

Last week I came accross a couple of articles by Confluent (managed cloud-native platform for Kafka) and StreamNative (managed cloud-native platform for Pulsar) comparing the two platforms performance abillities that I think are really interesting (though maybe a bit biased).

Spoiler alert! Both are best! 😉

You can study them here and here.

Performance

What I found really interesting is the high throughput and the almost constant low latency Pulsar seems to achieve no matter the amount of partitions used while still being able to guarantee really good durability (consistency and availability). If you know about the CAP theorem, you would almost think this is too good to be true!

Different Storage Strategies

Kafka stores messages in partitions, while Pulsar has another storage level called segments that make up a partition. This might partly explain why Pulsar can perform better at some scenarios.

Credit: Splunk

In Kafka as a partition grows, more data needs to be copied when brokers leave or join the cluster. In Pulsar, segments can be formed in smaller, static sizes. An important observation here is that the partions can have an inifite amount of segments spread out in the Apache BookKeeper storage nodes, also known as Bookies. Therefor partitions in Pulsar can have an infinite amount of messages, while in Kafka, partitions will be bound by the hardware it is stored on. Replicating a partition will become slower and slower in Kafka, but in Pulsar, it could theoratically stay almost constant because of the possibility to scale storage infinitly.

However, brokers in Pulsar need to potentially talk to many storage nodes to serve a partition, while a Kafka broker has the whole partition stored directly on it’s own disk. This could cause the brokers to end up having to handle a lot of connections to the storage nodes.

Consumer Group / Subscription Rebalancing

Kafka has made some improvements in their partion rebalancing strategy lately, like Static Membership and Incremental Cooperative Rebalancing which has sped up the consumer group rebalancing act.

In Pulsar, consumers use hash ranges in order to share the load when consuming from a shared subscription. The broker handles the rebalancing and as soon as a consumer leaves or joins, no messages will be delivered to any consumers until the hash ranges have been redistributed, either by the broker when using auto hash distribution, or by the consumers when using sticky hash ranges. This might cause downtime and latency due to the stop-the-world-strategy when redistributing the hashes. This was the case in Kafka as well before it was mitigated by Incremental Cooperative Rebalancing.

Broker Rebalancing

Kafka does not have any automated rebalancing built in. Instead users are left depending on tools like LinkedIn’s Cruise Control. Since Kafka stores the topic partition data and replicas directly on the brokers, this needs to be copied and rebalanced when adding a new broker.

Pulsar’s architecture on the other side, which separates computation (broker) and storage (partition), enables almost instant rebalancing since brokers can just switch from what storage they read from or write to. Apache ZooKeeper is in charge of monitoring broker health and initiates recovery when a broker is deemed lost. This will cause other brokers to take over the ownership of the lost broker’s owned topics. Any split-brain scenarios between the brokers are handled by BookKeeper through a fencing mechanism causing only one broker at a time to be allowed to write to the topic’s ledgers.


Credits: Jack Vanlightly

Metadata Storage

Both Pulsar and Kafka uses Apache Zookeeper as it’s metadata management storage. The Kafka team announced a while back that they are dropping Zookeeper in favor to bring metadata into Kafka for less complexity, replication, better scalability and bootstrap effeciency. It would be interesting to know if there has been a similar discussion around ZooKeeper within the Pulsar project, and what potential performance gains scrapping ZooKeper might give Kafka.

As a side note, ZooKeeper is a central part of BookKeeper as well, so dropping it all together would probably prove very difficult. If ZooKeeper goes down, Pulsar goes down.

Conclusion

Comparing two seamingly similar distributed platforms can be complex. While using similar test setups but with a few knobs tweeked, the result can differ a lot. I think this quote from one of StreamNative’s articles sums it up pretty good:

– “Ultimately, no benchmark can replace testing done on your own hardware with your own workloads. We encourage you to evaluate additional variables and scenarios and to test using your own setups and environments.”

Understanding the impact infrastructure and architecture has on systems and applications, observability, operations and to have the knowledge about all those things have become more important than ever.

Leave a Comment