Lauterbrunnen

If you use Kafka to connect your microservices, sooner or later you start thinking about how to ensure that your data (stored in some database) and your events (the ones you publish in Kafka to notify anyone else about your data changes) tell the same story.

Recently I’ve been talking to someone that faced this problem. Instead of going through the different alternatives (this article is a nice reference) I’ll try to recap the key points (in my experience).

  • Use transactional outbox pattern. There are many different ways to integrate a non-transactional resource (like Kafka) into a transaction but I wouldn’t go that way because …

  • It’s not easy to get strict ordering in a performant way from the outbox table. Think of a SORT BY criteria that reflects transaction order (you get the rows in the same order that you’ll get those INSERTs if inspecting the transactional log). I couldn’t think of any but inspecting the transactional log. So ….

  • Don’t extract your events through SQL (i.e. Kafka JDBC Connector).

  • Prefer the transactional log (or equivalent). If your database is supported, you can use an out-of-the-box solution like Debezium. In MySQL (or MariaDB) you can leverage blackhole storage so that you don’t need to take care of cleaning up your outbox table.

  • Don’t fight duplicates. Idempotency and partial ordering are your friends!

  • As we are going to consume it through the transactional log you can’t use a complex relational model with related tables. Just one table (per use case). That’s why you should keep your business model and your outbox tables separated.

  • How should you store your data in the outbox table? Although there’s only one table you may need to store complex data with relations 1-n (think of an order with many order lines). Relational model (with just one table) is not best suited for the task. Abuse the relational model and encode the info in one column (json, avro, etc…). You can just store the event in exactly the same format (binary compatible) that you’ll use to push to Kafka so that your CDC layer doesn’t have to do any transformation.