architecture – coding like a boss

I’ve been wondering how well Amazon DynamoDB would fit an event store implementation. There were two different designs I wanted to explore, both are described in this article, of which I implemented one of them. The source code is available on GitHub, including a nuget package feed on nuget.org.

Event Store Basics

An event store is a storage concept that stores events in a chronological order. These events can describe business critical changes within a domain aggregate. Besides storing the events of the aggregate, an aggregate state can be stored as a snapshot to avoid reading all events each time the aggregate needs to be rebuild. This can boost performance both in regards of latency and the amount of data that needs to be transferred.

DynamoDB

DynamoDB is a large scale distributed NoSQL database that can handle millions of requests per second. It’s built to handle structured documents grouped in partitions which can be stored in order within a partition key. This has obvious similarities to how an event stream for an aggregate looks like, seems promising!

Another neat feature with DynamoDB is DynamoDB Streams and Kinesis Data Streams with DynamoDB, which both can stream changes in a table to various other AWS services and clients. No need to implement an outbox and integrate with a separate message broker. Add point-in-time recovery and it is possible to stream the whole event store at any time!

Separated Snapshot and Events

Let’s start with the first design that uses composite keys to store snapshots and events grouped by aggregate.

The events are grouped into commits to create an atomic unit that has consistency guarantees without using transactions. Commits use a monotonically increasing id as sort key, while the snapshot uses zero. Since sort keys determine in what order items are stored the commits become ordered chronologically with the snapshot leading, meaning the commits of events can be fetched with a range query while the snapshot can be fetched separately. It makes little sense fetching them all, even though that would also be possible. The snapshot includes the sort key to the commit it represent up until in order to know where to start querying for any events that have not yet been applied to the snapshot.

Do note that there is an item size limit of 400kb in DynamoDB, which should be more than enough to represent both commits and snapshots, but as both are stored as opaque binary data they could be compressed. Besides lowering the size of the items and round-trip latency, this can also lower read and write cost.

When storing a commit the sort key is monotonically increased, this is predictable and therefor can be used as a condition to introduce both optimistic concurrency and idempotency in order to prevent inconsistency due to multiple competing consumers writing events at the same time or a network error occurs during a write operation.

"ConditionExpression": "attribute_not_exists(PK)"

Snapshots can be stored once it precedes the size of the non-snapshotted commits to save cost and lower latency. As snapshots are detached from the event stream it doesn’t matter if storing it after a commit succeeds or fails. If it fails the write operation could be re-run at any time. Updating a snapshot includes guarantees for optimistic concurrency and idempotency by only writing if the version the snapshot points to is higher than the currently stored snapshot or if the attribute is missing all together, which means no snapshot exists.

"ConditionExpression": "attribute_not_exists(version) OR version < :version"

More about conditional writes can be found here.

This was the solution I chose to implement!

Interleaving Snapshots and Events

This was an alternative I wanted to try out, interleaving snapshots and events in one continuous stream.

The idea was to only require a single request to fetch both snapshots and the trailing, non-snapshotted, events, lowering the amount of roundtrips to DynamoDB increasing possible throughput. Reading commits and the latest snapshot would be done by reading in reversed chronological order until a snapshot is found.

This however presents a problem. If a snapshot is stored after say every 10th commit, 11 items have to be queried to avoid multiple roundtrips even though the first item could be a snapshot making the 10 other items redundant. Further more, there are no guarantees when a snapshot get’s written, hence there are no way to know upfront exactly how many items to read to reach the latest snapshot.

Another problem is that all snapshots have to be read when reading the whole event stream.

Conclusion

DynamoDB turns out to be quite a good candidate to take the job as a persistence engine for an event store, supporting the design of an ordered list of events including snapshots, and has the capabilities to stream the events to other services. Replaying a single aggregate can be done with a simple ranged query and it’s schema less design makes it easy to store both commits and snapshots in the same table. It’s distributed nature enables almost limit less scalability and the fact that it is a managed service makes operating it a breeze.

Very nice indeed!

Lessons to be Learned

I’ve been developing systems using microservices for quite a few years now, and there are alot of lessons learnt (and still lessons to be learned). I saw this presentation by Matt Ranney from Uber last year, where he talkes about the almost ridiculous amount of services Uber has, and the insight of all the problems that comes with communicating between all these independent and loosly coupled services. If you have ever developed asynchronous applications, you probably know what kind of complexity it might generate and how hard it can be to understand how everything sticks together. With microservices, this can be even harder.

The World of Computer Systems are Changing

I recognize many of the insights he shares from my experiences of building microservices. I recently did some developing using akka.net experiencing similar insights but on a whole new level. Microservices within microservices. I won’t jump off to that now, maybe I’ll share those thoughts at another occasion. However, microservice architectures today are getting more and more important. One reason is because the stagnation of hardware speed with cpus changing from the traditional one core where the clock frequency is increased between models to today where the cores instead are multiplied without the frequency increasing. But also because it gives you freedom as a developer when facing hugh applications and organisations. Also there is this thing called zero downtime. You might have heard of it. Everything has to work all the time.

Synchronisation

While I do tend to agree with most of what Matt says, I don’t agree with the being “blocked by other teams” statement. If you get “blocked” as a team, you are doing something wrong, especially if you are supposed to be microservice oriented.

Blocked tends to point towards you needing some sort of synchronisation of information, and before you have that you cannot continue. While synchronisation between systems must occur at some point, it does not mean that you cannot develop and release code that cannot be fully utilized until other systems have been developed and deployed. Remember agile, moving fast, autonomous, and everything has to work all the time? The synchronisation part is all about the mutable understanding of the contract between the services. When you have that, it’s just a matter of using different techniques to do parallell development without ever being blocked, like feature toggling. There need to be a “we are finished let’s integrate”-moment at some point, but it’s not a blockage per se. It’s all about continuation, and it is even more important with microservices as integration get’s more complex with more services and functionality being developed in parallel.

Context

Systems developed as microservices are also facing the problem of process context, or the start and end problem. As a user doing something you are usually seeing this doing as from a bubble perspective. You update a piece of information and expect to see a result from that action. But with microservice based systems, there might be alot of things going on at the same time during that action in many systems. The action does not necessary have the cause you think, and the context gets chopped in smaller pieces as your context now spans multiple systems. This leads to the distribution problem. How do you visualize and exaplain what’s happening when alot of things are happening at roughly the same time at different places? People tend to rely on synchronisation to explain things, but synchronisation is really hard to achive if you do not have all the context at the same time and place, which is next to impossible when it comes to parallelism, distribution, scaling, something you often get and want to have with microservices; asynchronicity. You might want to rethink the way you perceive the systems you are working with. Do you really need synchronisation, and why? It’s probably not an easy thing to just move away from as it is deep rooted in many peoples mind. A way to simplify actions happening. But things are seldom synchronous in the world, and as computers and systems are becoming more distributed, it will make less and less sense keeping it that way.

REST

I also think the overhyped use of the word REST might extend the problem. REST implies that model states is important. But many microservices are not always built around the concept of states and models. They want to continuesly change things. Transitions are really hard to represent as states. I’m not talking about what something transitioned TO but how to visualize the transition or the causation, the relationship between cause and effect. Sometimes functionality is best represented as functions and not the states before and after. Back are the days of RPC services. Why not represent an API as a stream of events? Stateless APIs! It’s all about watching things happen. Using commands and events can be a powerful thing.

Anyhow, microservices are great, but they might get you in situations where you feel confused as you try to apply things that used to work great but does not seem to fit anymore. By rethinking the way you percieve computer systems you will soon find new ways and it might give you great new possibilities. Dare to try new angles, and remember that it’s all about trade offs. Simulating the world exactly as is might not be feasible with a computer system, but still, treating it as a bounch of state models within a single process might not be the right way either.

Category: architecture

DynamoDB as an Event Store