All blog posts
Data Streaming

Highlights of the Kafka Summit London 2024

I actually felt a little nervous on the way to London. I would not only be meeting many old colleagues and friends at the Kafka Summit 2024, but also some true data streaming celebrities - for example when Jay Kreps, co-founder of Confluent, would be giving his keynote speech. To start off the event, this year's highlights of the Apache Kafka community were shared. It's what makes Kafka so unique: the users who see themselves as one big family. No surprise that in addition to the current Apache Kafka 3.7.0 release in February and the roadmap for Kafka 4.0, the 1000st Kafka Improvement Proposal (KIP) was celebrated: It's the proof that the community is alive and keeps growing.

A keynote uniting estates

Finally, Jay Kreps, the data streaming mastermind, and his team presented the Confluent Data Streaming Vision. They started by tackling a well-known problem: Data products have been around for quite some time, but until now they have mainly been used in the analytical and not in the operational estate - and they have never been linked. The result: instead of a "data mesh", the data usually ends up as a "data mess". Confluent's vision is therefore to merge the operational and analytical estates into universal data products, with Apache Kafka data streams as an open standard at the heart of this unification. In this context, Apache Flink acts as the standard processing engine for both batch and real-time use cases. And this is also relevant for non-technical readers - who I hope I haven't scared off yet: Thanks to this interaction, data can be used in different applications as a so-called data product without having to implement it separately in each case. This saves time and money - as long as the data exchange works reliably and quickly.

Here, the high-performance Apache Iceberg table format has established as the go-to solution. Iceberg allows data to be exchanged efficiently between the operational and analytical estates. Together with metadata from the schema registry, data in Iceberg format can be made available natively via Confluent Tableflow for further analysis with analytical engines. In addition to providing a stream catalog and stream lineage, the new data portal - sort of a "Google Maps for data" - also includes a module for data governance featuring corresponding rules and approval processes.

Some readers of this text are probably wondering why I'm making such a big fuss about this - after all, it's just a bunch of tools for data streaming. Let's take a step back and have a look at the big picture: This combination of Apache Kafka for data streaming, Apache Flink for stream processing and Apache Iceberg as a table format, provides, along with the Data Portal, the technological basis for the "Universal Data Product" - the holy grail of data streaming.

The passive-aggressive dishwasher event

Still somewhat blown away by the keynote and the opportunities it opened up, I moved on to the sessions. Sadly, many of them took place at the same time, so I couldn't attend all of them - but with over 100 sessions, that would have been a little too much anyway. So I first attended Tim van Baarsen from ING, who presented the company's own best practices for AVRO schema management and showed how ING ensures backward and forward compatibility. After this, I went on to Cedric Schaller. Using an example based on TWINT, he described where errors often occur within a distributed architecture - and how such issues can be addressed. Particularly interesting here was the pattern for controlling "poison pill events" within a dead letter queue, including the control of subsequent events with the same business key, to ensure that these events continue to be processed in the correct order.

During Oscar Dudycz's speech, I had a few good laughs: In his extremely amusing presentation, he introduced a whole series of "Event Modeling Anti-Patterns". For example, there was the "clickbait event", which is an exciting event but without any real content, or the "passive-aggressive event", which Dudycz illustrated with an everyday example: the phrase "Honey, the dishwasher has beeped and I think it's finished" is actually a "command event" and therefore means "Honey, why haven't you emptied the dishwasher yet". I wonder whether my wife also thinks this is funny and whether I should point it out to her the next time such an event occurs. Fortunately, Tom Bentley from Red Hat distracts me from such thoughts by presenting various ways of protecting Kafka environments from internal attacks. "Encryption-at-rest" plays an important role here. The open source project kroxylicious.io offers a transparent proxy including vault support for key management.

Intelligent chat bots

The next presentation by HelloFresh was something for true adventurers who sometimes feel like renaming topics in production. Md Nahidul Kibria explained how this can be done with the MirrorMaker 2 Connector without losing offsets or reference to consumer groups. Here is the link to the tutorial to avoid future kamikaze actions - save it now, thank me later.

Of course, the conference also included presentations on data streaming and artificial intelligence (AI). The combination of the two approaches is particularly fascinating, as the data obtained with data streaming provides the basis for machine learning and GenAI use cases. Airy has developed a platform designed as chat bot dialog which not only supports retrieval-augmented generation (RAG) using a large language model (LLM) to access unstructured data via vector databases. It can also combine the unstructured data obtained with structured data from various sources. The Apache Link SQL statements required for this are generated automatically via LLM to answer business questions and present the results neatly in chat form. Like this, the chat bot becomes a kind of AI - and knows the answer to (almost) every question.

Apache Flink: Star of the event

Despite all these exciting presentations and the data streaming celebrities in town, the real star of this year's event was clearly Apache Flink. Accordingly, numerous workshops were dedicated to the stream processing engine covering the full spectrum: from beginner sessions and training courses to best practices and deep dives, there was something for everyone. I opted for the session "Flink 2.0: Navigating the Future of Unified Stream and Batch Processing" by Martijn Visser, Senior Product Manager at Confluent. There I got exciting insights into the further development and roadmap of Flink 2.0. For example, in future it will be possible to mix batch and real-time in the same query and Time Travel will also be supported. Furthermore, the new storage architecture with S3 will act as primary storage and local disks only as cache. Last but not least, there are plans to standardize the APIs - all of which will make Flink even more scalable and easier to use.

Tired but happy, I headed home after two exciting days. Despite my exhaustion, my mind was still filled with impressions. For me, this year's Kafka Summit impressively demonstrated how the Kafka data streaming ecosystem continues to grow. The fact that the analytical and operational estates are now merging with Apache Flink and Apache Iceberg to enable Universal Data Products really fascinated me. During my journey home, I slowly doze off and dreamt of unified data estates in which data products act as the basis of the data mesh and thus put an end to the data mess once and for all. When I arrived in Zurich, I woke up and was delighted to realize that this wasn't a dream - it would soon become reality.

Ready to talk?

Do you have any questions? Does your business need our expertise? Or are you interested in one of our products? Drop us a message - we will get in touch as soon as we can.

Select
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.