Coherence vs Consistency for Continuous Intelligence Applications
by Simon Crosby, on May 6, 2021 9:30:00 AM
Applications that deliver continuous intelligence from streaming data must analyze a boundless stream of updates that can’t be stored or delayed, even during a network partition. The “fire and forget” nature of real-world events means updates could be lost, causing apps to fall out of sync and making automation risky.
There is a mismatch between our desire to support widely distributed applications and data sources, and their need to react instantly, accurately and in sync with the real-world. “Store then analyze” architectures, centered on a database, are a poor fit: Data flows are boundless, and the value of events is ephemeral. The states of data sources and the relationships between them demand constant re-evaluation but accessing disk or a networked database is too slow.
Events from real-world data sources shouldn’t be treated as ACID transactions - they are heavyweight, limit throughput and increase latency. Eventually Consistent (EC) databases improve performance but sacrifice safety, making automation risky. And, as Pat Helland says, “Eventual consistency means nothing both now AND later. It does, however, confuse the heck out of a lot of people”. Strong Eventual Consistency (SEC) allows some distributed updates to occur concurrently. But applications that analyze, learn, and predict “on-the-fly” are driven by event streams that can’t be stopped, so partitioned replicas will surely fall out of sync. We want distributed applications that respond in real time and that are robust to partitioning. As superbly articulated by Ben Stopford, we want applications that not only “know about the past” but can also project into the future - to help humans and automated systems anticipate change and respond fast.
Fortunately, we can have our cake and eat it by simplifying the system architecture. The result is applications that we call Coherent (a stream-centric “C” for CAP): They are distributed yet highly available, and “do the right thing” under partitioning: They offer strong safety and consistency in a local context (analyzing events close to data sources where accuracy and real-time reactions are key), while offering SEC under partitioning. To achieve this, we eliminate transactions. Instead, remote materialized views are continuously re-evaluated using cached state managed by a cache coherence protocol that minimizes update delays to ½ RTT. Under partitioning, remote views that rely on unavailable state are made aware of the partition and can choose an appropriate strategy.
Swim is intuitive and easy to use, meets our architectural goals and is blazingly fast. A distributed Swim application automatically builds a graph directly from streaming data: Each leaf is a concurrent actor (called a Web Agent) that continuously and statefully analyzes data from a single source. Non-leaf vertices are concurrent Web Agents that continuously analyze the states of data sources to which they are linked. Links represent dynamic, complex relationships that are found between data sources (eg: “correlated”, “near” or “predicted to be”).
A Swim application is a distributed, in-memory graph in which each vertex concurrently and statefully evolves as events flow and actor states change. Distributed analysis is made possible by the WARP cache coherency protocol that delivers strong consistency without the complexity and delay inherent in database protocols. Continuous intelligence applications built on Swim analyze, learn, and predict continuously while remaining coherent with the real-world, benefiting from a million-fold speedup for event processing while supporting continuously updated, distributed application views.