Skip to content

Tag: kafka

Pulsar – The New Kafka?

I’ve been following Apache Pulsar for a while now, reading what ever articles I come by. I thought I should write a short summary of my thoughts about this “new”, exiting, distributed message streaming platform.

Kafka

I have gained experience working with Apache Kafka for some time, and it does have a lot in common with Pulsar. It has been around for a couple of more years, and I think it is a good message streaming platform. I wrote a protocol definition of the Kafka protocol earlier this year together with an in-memory test framework including an example on how you can test applications that for example uses Confluent’s .NET Kafka client. You can read all about it here.

Pulsar vs Kafka

Last week I came accross a couple of articles by Confluent (managed cloud-native platform for Kafka) and StreamNative (managed cloud-native platform for Pulsar) comparing the two platforms performance abillities that I think are really interesting (though maybe a bit biased).

Spoiler alert! Both are best! 😉

You can study them here and here.

Performance

What I found really interesting is the high throughput and the almost constant low latency Pulsar seems to achieve no matter the amount of partitions used while still being able to guarantee really good durability (consistency and availability). If you know about the CAP theorem, you would almost think this is too good to be true!

Different Storage Strategies

Kafka stores messages in partitions, while Pulsar has another storage level called segments that make up a partition. This might partly explain why Pulsar can perform better at some scenarios.

Credit: Splunk

In Kafka as a partition grows, more data needs to be copied when brokers leave or join the cluster. In Pulsar, segments can be formed in smaller, static sizes. An important observation here is that the partions can have an inifite amount of segments spread out in the Apache BookKeeper storage nodes, also known as Bookies. Therefor partitions in Pulsar can have an infinite amount of messages, while in Kafka, partitions will be bound by the hardware it is stored on. Replicating a partition will become slower and slower in Kafka, but in Pulsar, it could theoratically stay almost constant because of the possibility to scale storage infinitly.

However, brokers in Pulsar need to potentially talk to many storage nodes to serve a partition, while a Kafka broker has the whole partition stored directly on it’s own disk. This could cause the brokers to end up having to handle a lot of connections to the storage nodes.

Consumer Group / Subscription Rebalancing

Kafka has made some improvements in their partion rebalancing strategy lately, like Static Membership and Incremental Cooperative Rebalancing which has sped up the consumer group rebalancing act.

In Pulsar, consumers use hash ranges in order to share the load when consuming from a shared subscription. The broker handles the rebalancing and as soon as a consumer leaves or joins, no messages will be delivered to any consumers until the hash ranges have been redistributed, either by the broker when using auto hash distribution, or by the consumers when using sticky hash ranges. This might cause downtime and latency due to the stop-the-world-strategy when redistributing the hashes. This was the case in Kafka as well before it was mitigated by Incremental Cooperative Rebalancing.

Broker Rebalancing

Kafka does not have any automated rebalancing built in. Instead users are left depending on tools like LinkedIn’s Cruise Control. Since Kafka stores the topic partition data and replicas directly on the brokers, this needs to be copied and rebalanced when adding a new broker.

Pulsar’s architecture on the other side, which separates computation (broker) and storage (partition), enables almost instant rebalancing since brokers can just switch from what storage they read from or write to. Apache ZooKeeper is in charge of monitoring broker health and initiates recovery when a broker is deemed lost. This will cause other brokers to take over the ownership of the lost broker’s owned topics. Any split-brain scenarios between the brokers are handled by BookKeeper through a fencing mechanism causing only one broker at a time to be allowed to write to the topic’s ledgers.


Credits: Jack Vanlightly

Metadata Storage

Both Pulsar and Kafka uses Apache Zookeeper as it’s metadata management storage. The Kafka team announced a while back that they are dropping Zookeeper in favor to bring metadata into Kafka for less complexity, replication, better scalability and bootstrap effeciency. It would be interesting to know if there has been a similar discussion around ZooKeeper within the Pulsar project, and what potential performance gains scrapping ZooKeper might give Kafka.

As a side note, ZooKeeper is a central part of BookKeeper as well, so dropping it all together would probably prove very difficult. If ZooKeeper goes down, Pulsar goes down.

Conclusion

Comparing two seamingly similar distributed platforms can be complex. While using similar test setups but with a few knobs tweeked, the result can differ a lot. I think this quote from one of StreamNative’s articles sums it up pretty good:

– “Ultimately, no benchmark can replace testing done on your own hardware with your own workloads. We encourage you to evaluate additional variables and scenarios and to test using your own setups and environments.”

Understanding the impact infrastructure and architecture has on systems and applications, observability, operations and to have the knowledge about all those things have become more important than ever.

Leave a Comment

Integration Testing with Kafka

Does the heading sound familiar? 2 years ago I released an integration testing framework for the AMQP protocol.

Now it’s time for something similar. Say hello to Kafka.TestFramework!

Kafka

Kafka was originally developed by LinkedIn and open-sourced in 2011. It is a distributed streaming platform that uses pub/sub to distribute messages to topics in form of a partitioned log. It is built to scale while maintaining performance no matter how much data you throw at it.

The Protocol

Kafkas protocol is build based on a request/response model. The client sends a request, and the broker sends back a response. Versioning is built into the messages, where each and every property is defined with a version range. This makes versioning quite easy to handle and it is possible to build one platform on top of all versions. Property types are defined by a mix of simple data types and variable-length zig-zag encoded types using Google Protocol Buffers.

All message definitions can be found on github, and the protocol is explained in detail in the Kafka protocol guide. Both of these resources make up the core of the Kafka.Protocol project.

Kafka.Protocol

The testing framework is built upon protocol definitions. Kafka.Protocol auto-generates all protocol messages, primitives and serializers into typed C# classes. The library makes it possible to write client and server implementations of the kafka protocol using only .NET. This makes it easier to debug and understand the protocol in a C# point of view and removes any dependencies to libraries like librdkafka and it’s interop performance implications.

Kafka.TestFramework

The Kafka.TestFramework aims to make it possible to test integrations with Kafka clients. It acts as a server and exposes an api that makes it possible to subscribe on requests and send responses to clients. It can run in isolation in-memory or hook up to a TCP socket and is based on the protocol definitions found in Kafka.Protocol.

Both Kafka.TestFramework and Kafka.Protocol are built on top of System.IO.Pipelines, a high-performant streaming library that greatly simplifies handling of data buffers and memory management. I’m not going to go into details on how that works, there’s already a lot of good articles about this library. Btw, did I say it’s FAST? 🙂

To see some examples on how the framework can be utilized, check out the tests where it communicates with Confluent’s .NET client for Kafka!

The Future

The Kafka protocol is quite complex, specifically when combined with the distribution nature of Kafka. Many of it’s processes can be hard to understand and might be interpretated as too low-level. These processes can probably be packaged in a simple way making it possible to focus on the high-level messages of the business domain.

Yet, at the same time, it is powerful being able to simulate different cluster behaviours that might be hard to replicate on high-level clients. Being able to get both would be nice.

Test it out, let me know what you think!

Leave a Comment

Distributed Tracing with OpenTracing and Kafka

OpenTracing has become a standardized way to write distributed tracing within the microservices architecture. It has been adopted by many libraries in all kinds of programming languages. I was missing a library in .NET for creating distributed traces between Kafka clients, so I decided to write one!

OpenTracing.Confluent.Kafka

OpenTracing.Confluent.Kafka is a library build in .NET Standard 1.3. With the help of some extension methods and decorators for the Confluent.Kafka consumer and producer it makes distributed tracing between the producer and the consumer quite simple.

Example

There is a simple console app that serves as an example on how to use the decorators to create a trace with two spans, one for the producing side and one for the consuming side. It’s build in .NET Core, so you can run it where ever you like. It integrates with Kafka and Jaeger, so in order to setup these systems I have provided a simple script and some Docker and Kubernetes yaml files for a one-click-setup. If you don’t run under Windows, you need to setup everything manually. The script should be able to act as an indication of what needs to be done.

The result when running the app should look something like the image below.

Happy tracing!

Leave a Comment