Hacking Kafka Connect

Second time we face the same problem: process Avro messages with Kafka Connect honoring a schema already setup in Schema Registry.

It looks like the typical interview problem you’ll never find in real life. But you know, it’s sitting on the sofa.

The interest of this note (if any) is not providing an answer but thinking about why we solved it differently the second time.


Kafka Connect is a generic framework for processing Kafka messages. Any transformation, any input/output format. You could even reuse the transformation across format changes.

You could because the transformation does not depend on the format. To achieve this Kafka Connect has a conversion layer (similar to the serializers / deserializers of Kafka Streams). Simplifying a bit we can say that this layer transforms the message that comes from Kafka into a neutral format that is the one on which the transformation is defined. If you change the format of the messages you just have to change the conversion layer.

This neutral format could have an schema (i.e. Avro). In that case, conversors handle both message and schema conversion.

Conversors are also used to handle the transformation from neutral format into specific format used to store messages in Kafka (i.e. JSON or Avro).

In our case both input and output format were Avro, both Avro schemas were the same (no message schema transformation) and were already setup in Schema Registry. Our process (for reasons that are difficult both to explain and understand) has no rights to write in Schema Registry.

Kafka Connect includes a suite of conversors for typical formats (Json, Protobuf, Avro, etc…) and usually it’s not necessary to create new conversors.

When you transform in a generic way messages and schemas between two formats (i.e. Avro and Kafka Connect neutral format) rich and generic always there are mismatches (or limitations). Avro out-of-the-box conversors handle most cases but they weren’t designed to honor the original schema even if the schema doesn’t change (why would they?) Imagine identity transformation applied over Avro messages in the input and pushing Avro messages in the output. You won’t loose any information but input/output messages (and schemas) are not going to be 100% equal (but for some trivial examples).

This conversion layer was designed to define transformations without taking into account input and output formats. And they do it fine. Not to handle corner-cases like these.


The first time we solved it by defining a new Avro converter (after Confluent’s). Avro converter is not trivial, so we opted to do “vendoring”. It works well but soon we found a drawback: in successive versions Confluent chose to sensibly refactor that connector and that caused our converter to stop working. How is it possible? That converter depends on several libraries that we did not include because they were the same ones used by your converter and they were already available. Its refactoring removed some of the libraries we depended on.

The second time we ran into this problem we were aware of the robustness limitations of the previous solution.

On this occasion, we completely do without Avro converters. The transformation was in charge of managing the parsing of the Avro message and its serialization using the standard serializers/deserializers. This solution loses the generality of the format (the transformation only works on Avro messages) and has the advantage that it is generic (it is robust to changes in the Avro scheme) and more efficient (the transformation to / from the neutral format is not necessary).

It’s a bit unconventional but at this moment I prefer it to the more “by the book” solution.