Brokers Aren't Databases
by Simon Crosby, on Jun 28, 2021 8:15:00 AM
The rise of event streaming as a new class of enterprise data that demands continuous analysis is uncontroversial. What’s puzzling is the approaches being taken by the event streaming community to storage of event data and the semantics they seek to achieve for event processing applications.
In a recent post, Apache Kafka community leader Guozhang Wang celebrates the (laudable) achievement of the Kafka community on delivering strong consistency guarantees for applications that process streaming events and deliver their outputs back to the broker. Implicit in this approach is the idea that application developers can treat the broker and its partitions like another database, and that database semantics are important for event streaming applications. (See this post on why coherence is a preferable goal for continuous intelligence applications).
You can imagine the appeal: a database that also is a broker for streaming data (Redis already does it) but it's a bad idea.
There is a large class of applications for which the gold standard of exactly-once semantics is crucial. These are applications that process events whose consequences are critically important - like withdrawals from my bank account. Perhaps real-world events such as systems that scream “fire!” exactly once should be included too.
Superb distributed databases already exist that offer exactly-once semantics and they avoid the complexity of forcing the broker and its arrival-time ordered logs of events in topic partitions, behave like a database engine.
Brokers are buffers between data sources in the real world and applications that consume those events asynchronously. That's hard enough. Using the broker as the output destination for new events that result from application processing on events forces the broker infrastructure to assume an additional role of being a loss-free, “forever” data store, in addition to a time-series event manager.
Using a broker to store all events for all time is also a bad idea. What matters is to process real-world events into durable insights, fast.
Do you need to store all events for all time? I hear unthinking engineers say “yes” and then I show them a Swim application that analyzes, learns and predicts from 4PB per day - delivering insights continuously in milliseconds per event - yet more than any organization could ever store. In future every organization will be inundated with streams of data from products, customers, infrastructure and suppliers. Unless you can turn the data quickly into durable insights, you’ll be stuck with huge storage costs and not much intelligence.
Making the broker a transactional database imposes substantial complexity and cost:
- Using a broker and its log-based event partitions as the state store for an application (independent of the database semantics) results in yet another round-trip to the broker and its disk-based storage infrastructure. The result is delay - and an application that runs a million times slower than the CPU and memory.
- It omits the most important class of applications that process events to enable automated responses. These applications need to analyze events, learn and predict immediately, to stay in sync with the real world. Delay is the enemy, and exactly-once semantics cause delay. They need to react instantly, accurately and in sync with the real-world.
Events from real-world data sources that are treated as ACID transactions impose significant processing consequences on the broker that ultimately limit throughput and increase latency. Continuous intelligence applications need to continuously analyze, learn, and predict “on-the-fly” and are driven by event streams that can’t be stopped, so insisting on exactly-once semantics impacts throughput, and will cause the application to drift in time relative to the real world. And if you think that all you need is more hardware or a bigger cloud, NOPE. Applications that respond to the real world to drive automation have to stay in-sync, and every disk access or network hop takes millions of CPU cycles - maybe tens of milliseconds. That’s too slow.
You can’t solve problems in the domain of continuous intelligence with a broker that has to keep up with the real world and enforce exactly-once processing semantics. You can solve these problems using actors (Akka streams, Swim) that continuously and concurrently compute.
Swim is intuitive, easy to use and blazingly fast. A distributed Swim application automatically builds a graph directly from streaming data: Each leaf is a concurrent actor (called a Web Agent) that continuously and statefully analyzes data from a single source. Non-leaf vertices are concurrent Web Agents that continuously analyze the states of data sources to which they are linked. Links represent dynamic, complex relationships that are found between data sources (eg: “correlated”, “near” or “predicted to be”).
A Swim application is a distributed, in-memory DAG in which each vertex concurrently and statefully evolves as events flow and actor states change. Distributed analysis is made possible by the WARP cache coherency protocol that delivers strong consistency without the complexity and delay inherent in database protocols. Continuous intelligence applications built on Swim analyze, learn, and predict continuously while remaining in sync - coherent - with the real-world, benefiting from a million-fold speedup for event processing while supporting continuously updated, distributed application views.