Data & AI

Highlights of the Kafka Summit London 2024

Dirk Budke

March 25, 2024

Data & AI

This year's Kafka Summit took place at ExCeL in London from March 19 to 20, 2024. The more than 3000 enthusiastic participants could look forward to over 100 sessions to discuss the future of data streaming. However, the star of the convention was Apache Flink, which, together with Apache Kafka and Apache Iceberg, merges the operational and analytical estates and redefines the foundation of the data mesh through Universal Data Products. An experience report featuring personal highlights.

I actually felt a little nervous on the way to London. I would not only be meeting many old colleagues and friends at the Kafka Summit 2024, but also some true data streaming celebrities - for example when Jay Kreps, co-founder of Confluent, would be giving his keynote speech. To start off the event, this year's highlights of the Apache Kafka community were shared. It's what makes Kafka so unique: the users who see themselves as one big family. No surprise that in addition to the current Apache Kafka 3.7.0 release in February and the roadmap for Kafka 4.0, the 1000st Kafka Improvement Proposal (KIP) was celebrated: It's the proof that the community is alive and keeps growing.

A keynote uniting estates

Finally, Jay Kreps, the data streaming mastermind, and his team presented the Confluent Data Streaming Vision. They started by tackling a well-known problem: Data products have been around for quite some time, but until now they have mainly been used in the analytical and not in the operational estate - and they have never been linked. The result: instead of a "data mesh", the data usually ends up as a "data mess". Confluent's vision is therefore to merge the operational and analytical estates into universal data products, with Apache Kafka data streams as an open standard at the heart of this unification. In this context, Apache Flink acts as the standard processing engine for both batch and real-time use cases. And this is also relevant for non-technical readers - who I hope I haven't scared off yet: Thanks to this interaction, data can be used in different applications as a so-called data product without having to implement it separately in each case. This saves time and money - as long as the data exchange works reliably and quickly.

Here, the high-performance Apache Iceberg table format has established as the go-to solution. Iceberg allows data to be exchanged efficiently between the operational and analytical estates. Together with metadata from the schema registry, data in Iceberg format can be made available natively via Confluent Tableflow for further analysis with analytical engines. In addition to providing a stream catalog and stream lineage, the new data portal - sort of a "Google Maps for data" - also includes a module for data governance featuring corresponding rules and approval processes.

Some readers of this text are probably wondering why I'm making such a big fuss about this - after all, it's just a bunch of tools for data streaming. Let's take a step back and have a look at the big picture: This combination of Apache Kafka for data streaming, Apache Flink for stream processing and Apache Iceberg as a table format, provides, along with the Data Portal, the technological basis for the "Universal Data Product" - the holy grail of data streaming.

The passive-aggressive dishwasher event

Still somewhat blown away by the keynote and the opportunities it opened up, I moved on to the sessions. Sadly, many of them took place at the same time, so I couldn't attend all of them - but with over 100 sessions, that would have been a little too much anyway. So I first attended Tim van Baarsen from ING, who presented the company's own best practices for AVRO schema management and showed how ING ensures backward and forward compatibility. After this, I went on to Cedric Schaller. Using an example based on TWINT, he described where errors often occur within a distributed architecture - and how such issues can be addressed. Particularly interesting here was the pattern for controlling "poison pill events" within a dead letter queue, including the control of subsequent events with the same business key, to ensure that these events continue to be processed in the correct order.

During Oscar Dudycz's speech, I had a few good laughs: In his extremely amusing presentation, he introduced a whole series of "Event Modeling Anti-Patterns". For example, there was the "clickbait event", which is an exciting event but without any real content, or the "passive-aggressive event", which Dudycz illustrated with an everyday example: the phrase "Honey, the dishwasher has beeped and I think it's finished" is actually a "command event" and therefore means "Honey, why haven't you emptied the dishwasher yet". I wonder whether my wife also thinks this is funny and whether I should point it out to her the next time such an event occurs. Fortunately, Tom Bentley from Red Hat distracts me from such thoughts by presenting various ways of protecting Kafka environments from internal attacks. "Encryption-at-rest" plays an important role here. The open source project kroxylicious.io offers a transparent proxy including vault support for key management.

Intelligent chat bots

The next presentation by HelloFresh was something for true adventurers who sometimes feel like renaming topics in production. Md Nahidul Kibria explained how this can be done with the MirrorMaker 2 Connector without losing offsets or reference to consumer groups. Here is the link to the tutorial to avoid future kamikaze actions - save it now, thank me later.

Of course, the conference also included presentations on data streaming and artificial intelligence (AI). The combination of the two approaches is particularly fascinating, as the data obtained with data streaming provides the basis for machine learning and GenAI use cases. Airy has developed a platform designed as chat bot dialog which not only supports retrieval-augmented generation (RAG) using a large language model (LLM) to access unstructured data via vector databases. It can also combine the unstructured data obtained with structured data from various sources. The Apache Link SQL statements required for this are generated automatically via LLM to answer business questions and present the results neatly in chat form. Like this, the chat bot becomes a kind of AI - and knows the answer to (almost) every question.

Apache Flink: Star of the event

Despite all these exciting presentations and the data streaming celebrities in town, the real star of this year's event was clearly Apache Flink. Accordingly, numerous workshops were dedicated to the stream processing engine covering the full spectrum: from beginner sessions and training courses to best practices and deep dives, there was something for everyone. I opted for the session "Flink 2.0: Navigating the Future of Unified Stream and Batch Processing" by Martijn Visser, Senior Product Manager at Confluent. There I got exciting insights into the further development and roadmap of Flink 2.0. For example, in future it will be possible to mix batch and real-time in the same query and Time Travel will also be supported. Furthermore, the new storage architecture with S3 will act as primary storage and local disks only as cache. Last but not least, there are plans to standardize the APIs - all of which will make Flink even more scalable and easier to use.

Tired but happy, I headed home after two exciting days. Despite my exhaustion, my mind was still filled with impressions. For me, this year's Kafka Summit impressively demonstrated how the Kafka data streaming ecosystem continues to grow. The fact that the analytical and operational estates are now merging with Apache Flink and Apache Iceberg to enable Universal Data Products really fascinated me. During my journey home, I slowly doze off and dreamt of unified data estates in which data products act as the basis of the data mesh and thus put an end to the data mess once and for all. When I arrived in Zurich, I woke up and was delighted to realize that this wasn't a dream - it would soon become reality.

All blog posts

RAG: Retrieval-Augmented Generation Explained Simply

This blog provides an overview of how RAG emerged, why it is so important for the use of LLMs in enterprises, and which application areas are particularly well-suited for it. Additionally, we illustrate how RAG works in practice using a project at a Swiss bank and highlight which aspects of data protection and compliance need to be considered.

Dirk Budke

Maximilian Walser

May 9, 2025

Apache Kafka simply explained

In today’s world, where data needs to be processed faster and in ever-increasing volumes, a reliable and scalable infrastructure is essential. Apache Kafka has emerged as a leading solution for real-time data streaming and is used by businesses worldwide to capture, analyze, and distribute data efficiently. In this blog post, we explain what Apache Kafka is, how it works, and why it is crucial for modern enterprises—simply and clearly.

Dirk Budke

Maximilian Walser

March 12, 2025

Dirk Budke showing his newspaper article about data management

Good data management is the basis for the business models of tomorrow

The rapid spread of artificial intelligence also creates new challenges for companies when it comes to data. Dirk Budke, Lead Data Engineering & AI at mesoneer, explains the importance of strategic data management and why employers should proactively introduce AI tools.

Dirk Budke

September 28, 2024

View all stories

Highlights of the Kafka Summit London 2024

A keynote uniting estates

The passive-aggressive dishwasher event

Intelligent chat bots

Apache Flink: Star of the event

More on Data Streaming

Frequently asked questions

Table of contents

Table of Content

All blog posts

RAG: Retrieval-Augmented Generation Explained Simply

Apache Kafka simply explained

Good data management is the basis for the business models of tomorrow

Ready to talk?