The (lacking) future of ETL
ETL provides no real value, it’s just really expensive glue. So-called “Zero-ETL” has the potential to free up so much $$$ that can be put towards something actually useful.
That said, we are a long, long way off from this becoming a reality. ETL is going to be around for a long while yet.
I think we’ll see stepped adoption: smart data vendors will begin to build “ETL” into their products, just as a value-add feature, simplifying their users’ stacks and enabling easier adoption & consumption of their tool.
Slower/larger vendors will either buy the ETL vendors in the market, or create a managed service from FOSS tools. They’ll package it up to heavily incentivise it over anything else.
The ETL vendors that don’t get bought will struggle to convince people that they need to part with their cash for something that should just be a feature and doesn’t provide any value of its own.
Adjacent, we’ll see more adoption of common backends, storage layers, protocols (+ a bunch of new ones, some good, some just cashing in on the hype train). Using common storage layers, data formats, table structures (Apache Iceberg , Delta Lake), and using common interfaces (duckdb, Apache Arrow).
The developers of these systems will be freed from re-building yet another data format, ser/des, network IPCs, and can focus on building the bits that actually “do something different” or “do something useful”.
The developers using these systems will be freed from needing to learn yet another ETL tool, building pipelines, maintaining more infra…and can focus on making their data valuable.
The business will be freed from the ever-increasing spending bloat of ETL, and can instead invest that $$$ in properly utilising, or adopting new, tools that let them do new things.
Again, we’re a long way off from this being a reality, and there’s going to be a lot of marketing noise from vendors that try to convince you that they have already have Zero-ETL (they don’t).
But it’s an exciting vision of the future.
duckDB, Apache Arrow and Apache Iceberg are going to be core to the future of data tooling.
The storage layer is ripe for innovation. Each cloud vendor has a blob storage service, and they’re all pretty old and have limitations that aren’t keeping up with everything else (particularly around speed). There needs to be innovation here, ideally with a standardized API.