Conversations with 200+ people at Kafka Summit

I spent 2 days at Kafka Summit 2024 and spoke with over 200 people. They were loooong days.

But it’s an amazing way to completely surround yourself with people who are actually doing the job, and that’s such a fantastic opportunity to learn.

There’s plenty of posts out there that summarize the talks, recap the keynote and hype the vendor releases, so I’m not going to talk about any of that. This post is just going to cover the conversations that I had with practitioners that attended the event. Given that was over 200 conversations, this is going to be my personal distillation of what we spoke about.

Of course, there’s some bias here: it’s a Kafka event so it’s generally folks with streaming on their mind, I have my own biases that will influence the natural path conversations take, and I was there representing my company. I believe the topics are still useful with these biases in mind.

The four main themes that I found myself discussing were:

Operational Analytics
Event Driven Architectures
Unified ETL with Apache Flink
Automation and Data as Code/Config (DaC)

Operational Analytics

I can’t count the amount of conversations I had around this one.

So-called “operational” systems tend to be the backends of a business’ application, often described with a combination of terms like relational, transactional, CRUD, ACID or documents - think databases like Postgres or MongoDB. Generally, they need to support a lot of key-based single-row READs, as well as single-row DELETE and UPDATEs.

The “analytical” systems, on the other hand, are powering the companies’ internal reporting and business intelligence. Here, the name of the game is generally “Can I stick Tableau on it and do crazy JOINs over a few TBs of data?”. It’s going to power some pretty crazy queries that will scan a huge amount of historical data to try and answer a question.

However, many attendees are starting to see that operational systems, or the applications that use them, now want to utilize analytics. Most often, this is in the form of user-facing analytics, where the output of analytical queries can be served back to users within the application. This appears to be causing quite the headache, as there is typically a large divide between the “operational” and “analytical” teams or systems, and existing “operational” systems cannot handle the analytical workload.

Simply put, the operational team can’t just start running massive analytical queries over their Postgres database without tanking the service, and building a solution over the internal Snowflake is too slow & cumbersome.

“Operational analytics” seemed to be the most popular term for this idea, which makes sense.

Event Driven Architectures

Event Driven Architectures (EDA) isn’t a new subject for a Kafka conference, I’d say it’s probably one of the longest living themes (though definitely behind “Kafka is a pain to manage”). In the past, I’ve found that EDA is usually something that smaller, newer businesses are very keen on, but remains largely unexplored by larger enterprises. It felt a bit different this year, with many of these conversations occurring with engineers from global giants.

The conversations were largely the same as they always were; folks are tired of pushing data into silos then building a bunch of glue around it to work out if they need to take action, and then more glue to actually take the action. The challenge here is that much of this “glue” requires domain knowledge of the systems on either side: you need to know where the data lives and how its stored, write logic against that model to determine if action is required, and then translate this into whatever the destination system needs. Because of the inter-domain nature of this glue, it’s common that ownership of the glue itself is unclear. It’s also typically quite brittle, as systems on either side may change, and there is often a lack of communication to notify other teams of changes that may break things.

Instead, new data should be treated as events that represent something happening, be pushed onto a central bus, and downstream systems can subscribe to the stream(s) of events that they care about. This means that any downstream systems and team has a single, known and consistent integration point when needing access to data. This makes it much easier for each domain to self-service access to data, keep full ownership of their systems and make changes without breaking other consumers.

Of course, you can also become “event driven”, so rather than polling every 5 minutes and working out all of the actions you should have taken, you can start to adopt an “always-on” pattern, where events trigger individual actions as they arrive. This is probably what jumps into most people minds when they think of the benefits or purpose of EDA…but what I took away from all of these conversations is that the structural benefits of decoupling event ingestion & distribution from the myriad of downstream systems is probably the more immediate benefit to a lot of teams.

Unified ETL with Apache Flink

Which brings us to Apache Flink, which was the hot topic of Confluent and many other vendors at Kafka Summit this year.

Most people, even those already deep into it, will agree Flink is a pretty complex system. However, it brings some really nice things with it. It has been built to be ‘streaming first’, making it a fantastic choice when working with streaming data. And while it excels at streaming, Flink’s design is flexible and allows it to work with batch data as well, meaning you can reuse knowledge across both paradigms. On top, it has a growing ecosystem of input and output integrations, a SQL abstraction called ‘Flink SQL’ and has been tried and tested at a huge scale. (If you consider the EDA pattern described above, you might see how Flink positions quite nicely between a central streaming bus and anything else to the right hand side of it.)

The core purpose of Flink was to be a stream processing engine, but as is often the case, it seems the market is finding that there’s a different, and perhaps more widely applicable, fit for it. Its flexibility in working with both streaming & batch makes it attractive in the oft varied environments where ETL tooling becomes painful. The ability to write code is attractive to handle complex scenarios, while the comprehensive SQL abstraction makes it adoptable by teams who don’t have the resources to be effective with, or simply don’t need, that level of control. And the huge range of integrations make it a no brainer. Perhaps what we’ll see is that teams initially adopt Flink to solve ETL, and then ramp into stream processing use cases in the future.

Automation and Data as Code/Config (DaC)

“Do you have a Terraform provider?”

When I’m repping an event for Tinybird (a real-time data platform), I get asked this question a lot. People want to know if they can deploy & build with the platform using Terraform.

To be honest, I’m still surprised at the level of adoption Terraform has seen in data teams, but it seems quite well entrenched now. Though, I do wonder if that will change with IBM’s acquisition of HashiCorp. Do people really want to be beholden to Ol’ Big Blue?

I’ve seen this idea of ‘Data as Code’ (or ‘as Config’) deployed to spectacular results. It’s a pretty simple concept where every part of a Data Platform is defined in files - some folks call it Code, and others Config, generally it’s the same idea - including the platform itself, schemas, integrations, queries, jobs and all artifacts of actual use cases.

Perhaps the most striking benefit is the impact it has on how people work.

If you are in a data team, or working with one, you’ll probably be familiar with the pains around collaboration - who performs what work, who owns it, who reviews it, who supports it, etc. Some teams have gone for clear centralization, where the data team is the gatekeeper to everything “data”, while others have gone the ‘data mesh’ route, and federated as much as possible into domain teams.

Both approaches have their benefits, but in reality, neither approach has perfectly solved every pain.

By making this (mostly technical) change, both the centralized and decentralized ways of working are improved, but it opens a nice path to “controlled decentralization”.

There are significant benefits to a data team being able to make well-informed decisions about data infrastructure, and centralizing knowledge and experience makes it easier to support and appropriately resource work. But we’re all aware of the discussion around ‘domain expertise’ and the push for data teams to be ‘closer to the business’. This makes sense, but we’ve been saying it for over a decade and it isn’t happening.

I think it’s pretty unrealistic to expect that data teams will become self-sufficient in business domain knowledge. Businesses are too different, even within a single industry, let alone across industries (and most data engineers don’t stay in the same industry their whole career).

Instead, this DaC model allows us to centralize knowledge while federating responsibilities. The responsibility for defining and owning the data platform is given to the data team, who are best placed for it. However, that knowledge is centralized in a repository that is open to all. Similarly, the specifics of building use cases are federated to the teams who understand them best, but they also push that knowledge to the central repository. The repository now becomes a place for those two teams to collaborate. The business teams can work at their own pace to build functionality, while the data team can put guardrails in place that allow for most ‘overhead’ work (e.g. deploying) to be automated.