@DanLebrero.

software, simply

Kafka, GDPR and Event Sourcing

Proof of concept for compliance with GDPR on an Event Sourcing architecture built with Apache Kafka.

Image attribution: Modified from "He Wasn't This Angry" by Allison Mickel is licensed under CC BY-NC-SA

You probably already know that the EU has approved this nice piece of legislation called GDPR (General Data Protection Regulation) that gives us back some control over our personal data.

From a technical point of view, if you have bought into Event Sourcing and Kafka, it is of special interest GDPR’s “right to erasure” (aka. forget everything that you know about me), as it is at odds with the idea of an immutable event log that does not forget anything.

To handle GDPR in an event sourced architecture, here are the most interesting options:

  1. Removing data from projections might be good enough. A suggestion from Michiel Rook’s blog is that maybe is enough to remove the data from the projections/read models, and there is no need to touch the data in the event store. If this option is within the law, the “right to erasure” becomes just another event that projections need to handle. A perfect fit for Event Sourcing.

  2. Deleting/updating Kafka messages: Ben Stopford reminds us that in Kafka you can “delete” and “update” messages if you are using a compacted topic, which means that to comply with the “right to erasure”, we need to find all the events for a user and for each send a new message with the same key (the event id) and a null (or updated) payload.

    The main concern with this approach is that the event store is no longer immutable, so it will be very tempting to use the same loophole in other non-GDPR situations.

  3. Encryption: Another suggestion from Michiel’s blog is to encrypt all the messages for a particular user with a key, and when the user want to exercise its “right to erasure”, we just need to forget the encryption key.

    The issue with this approach in the key management. In Michiel’s words: “storing, finding and retrieving the right encryption key … becomes especially interesting at scale”. And because it is interesting, let’s dive into a possible solution.

Highly available, highly scalable RESTful KeyManagement service

Synchronous HTTPS? Seriously?

The Kafka way

Assuming that you are already storing your data in Kafka, and given that Kafka is able to handle data at scale, why not use Kafka itself to store and retrieve the encryption keys?

Let’s start with a picture of how our architecture could look like:

Kafka GDPR encryption architecture

Your Event Producer is your regular service that pushes unencrypted data to some To-Encrypt topic.

To comply with GDPR, this topic will have some reasonably short time-based retention policy, so that Kafka deletes the data after that time, but remember that the retention period should be longer than your expected downtime of the Encryptor service, as if the Encryptor service is down for longer, Kafka may delete the data before it is encrypted and safely stored in the Encrypted-Data topic.

The Encryptor service will take care of encrypting any message and generating new encryption keys for new users. It leverages Kafka Streams state management to keep a local copy of the encryption keys for the partitions that each instance owns, so that looking up an encryption key will be at most a disk seek.

This application also has to react to the user exercising his right to be forgotten by deleting the local copy of the encryption key from its state, and by deleting the encryption key from the Encryption-keys topic.

The Encrypted-Data topic will be where the events are stored forever, with no retention policies. This is your event log.

The Encryption-Keys topic will be a compacted topic. When it is time to forget the user, the Encryptor service will just send a tombstone to override the user’s encryption key, so it is lost forever and nobody will be able to decrypt its data again.

To decrypt the data, the Event Consumer will basically will need to do a join of the Encrypted-data topic with the Encryption-Keys topic. Again, we will rely on Kafka Streams state management to keep a local copy of the encryption keys.

Similar to the Encryptor, the Event Consumer will need to react appropriately when the user request to be forgotten, both by deleting the local encryption key and any other state associated with that user.

This architecture looks fabulous from this ivory tower.

ivory tower
Image attribution: The Ivory Tower by Peter Bartels.

Implementation details

If you want to get your hands dirty, the implementation details are here.

Conclusions

In summary, we comply with GDPR because our to-encrypt topic has a short time-based retention policy, our encryption keys are in a compacted topic and our event log is encrypted with a per-user encryption key.

Also, our applications have to handle a new “forget me” event type and erase any PII data that they may store.

As we saw, the implementation is not rocket science, but it raises some more challenges:

  1. Do we encrypt the whole message or just a subset? If it is just a subset, how do we handle schemas? If not a subset, we lose all the data, even the non-PII one.
  2. Can we reuse the same encryptor for multiple topics? If so, topics must be copartition. If not, we will need to separate the key generation from the encryptors, so the encryption keys can be repartition.
  3. Even if the decryption is transparent to the consumer, it still needs to handle the “forget me” special case.
  4. You will need to choose an encryption algorithm that is fast enough and secure enough. Can you afford an additional 1 or 10 milliseconds processing time to each message? In theory, if the consumer is up to date, it can always consume directly from the to-encrypt topic.
  5. A comment in Michiel blog points out that forgetting the key is not enough. Every few years, we also need to update encryption algorithms, which means we need to encrypt everything again.

So it seems possible to use encryption to handle event sourcing data in Kafka, but is it better than the other options? For sure it is worse than removing data from projections, if this is an option at all. But, is it better than just using a compacted topic to store the event log as Ben Stopford suggests?

Well, how much do you value immutability? That much?!?! That little?!?!


Did you enjoyed it? or share!

Tagged in : Architecture Kafka