Brokers Don't Run Applications...
by Simon Crosby, on Jun 14, 2021 9:00:00 AM
Streaming data contains events that are updates to the states of data sources. When choosing an architecture to process it, the role of an event broker, like Apache Kafka or Apache Pulsar, is crucial - but necessarily limited. The problem is that all your nicely separated event streams get mixed up by the partitions of the broker cluster - a bit like the M&Ms picture above. Brokers are great at unifying access to streaming data, but applications then have to pick it apart to make sense of it. Applications are always stateful, and stateful analysis of streaming data, on-the-fly, simplifies a number of challenging problems
- Data is noisy – Many real-world systems are noisy and repetitive; and large numbers of data sources add up to a huge amount of data. If raw data is delivered as a stream of pubs from the edge, the transport cost can be huge. Similarly, storing data at the pub/sub broker can become expensive. This is nontrivial in the case of Kafka because the commit log needs to be replicated and can exceed the capacity of a server. It is important, therefore to clean data feeds at the earliest possible opportunity. The problem is exacerbated if one source is verbose, and another critical component is terse.
- State matters, not data – Streaming environments never stop producing data but analysis is dependent on the meaning of those events - the state changes that the data represents. Even de-duplicating repetitive updates requires a stateful processing model. This means that the stream processor must manage the state of every object in the data stream. The state must be persisted for resilience. For large numbers of objects, potentially many server instances could be required to perform stateful updates. This immediately forces the state store into a separate database so that many instances can work in parallel on the same state store. This complexity is a strong negative when compared to the basic promise of pub/sub – namely a simple producer-consumer pattern.
- Using brokers to manage storage is problematic – It is tempting to use the event partitions as durable storage for data. However, at best the result is a time-series of raw data which is difficult to search or process. Using brokers to manage tiered archival storage is a bad idea because it detracts from their key purpose – timely delivery of new events.
- Brokers manage the flow of event data. Event data must be transported to the broker cluster, and then the streaming application needs to sift through it to understand the relevant state changes
- Swim uses an “analyze then store” architecture in which concurrent, stateful web-agents act as digital twins of real-world entities. Web-agents are resident in-memory and distributed over fabric of compute resources. Each statefully consumes raw data from the real world, and the set of web agents is a mirror of the real world.
- In most pub/sub architectures the application developer must explicitly manage data-to-state conversion, state storage, and (perhaps separately) analysis. This is complex, difficult to maintain, and understand.
- By contrast, Swim applications are easy to build: The developer provides a simple object relationship model between the “things” in the data. When raw streaming data from some new entity in the real-world arrives, Swim creates a concurrent actor - a web agent / smart digital twin – for the entity. The web agent concurrently processes raw data from that source and always maintains its current state. Web agents link to each other to express real-world relationships like containment and proximity; but linking also allows web agents to see each other’s current state, in real-time. In this way Continuum builds the model of the real-world from data, and simultaneously translates data to state changes – simplifying analysis
- Remember: pub/sub systems don’t analyze data
- Kafka/Pulsar use centralized (but clustered) architectures for event handling. This is separate from stream processing and analysis. An application performs analysis. Brokers don't run applications.
- Rightly so, their support for windowing, triggering and other time / event aligned processing is nonexistent.
Swim is an application platform that builds models of the real-world from streaming dat and transforms events into state transitions in real-time. Web agents are concurrent and can compute at any time - processing is not triggered by data arrival. In particular, analysis can occur on a different time scale from data delivery / data-to-state conversion.
Swim can consume data from any broker (though it doesn't need one), publish to a broker or a database, and operates as a stream processor that builds a model of web agents on the fly from streaming updates from real-world sources. It supports rich contextual data-to-state conversion and analysis, before saving insights in a data lake or delivering them to other apps.