Data is Meh – State & Dynamic Relationships Count
by Simon Crosby, on Jun 9, 2021 9:15:00 AM
Streaming events are updates about the states of things: applications, devices & infrastructure. When choosing an architecture to process them, the role of an event broker, like Apache Kafka or Pulsar, is limited. Brokers, like databases, don't run applications. The transformation of events into a meaningful application semantic is always stateful, and the key is what you do with the data. But
- Data is noisy – Many real-world sources are noisy and repetitive; and large numbers of data sources add up to a huge amount of data. If raw data is delivered as a stream of pubs from edge devices, the cost of transporting raw data to the broker and then to analysis can be huge. Storing data at a broker and managing replication in a broker cluster is expensive and complex.
It is important to clean data feeds at the earliest possible opportunity. Cleaning is not simply data reduction. By analyzing, learning and predicting on the fly we can achieve a massive reduction in volumes.
In the largest Swim deployment we are aware of, over 4PB (that’s 4x1015B) of streaming data per day are used to analyze, learn and predict connection qualities for mobile devices, to optimize a carrier’s network. Transporting and storing this volume of data to a cloud is simply unthinkable. Instead, learning on-the-fly, close to data sources allows us to achieve a factor of 105 reduction both by reduction of data volumes and by predicting the future states of over 150M devices.
- State matters, not raw data – Streaming sources never stop – but analysis is dependent on the meaning of those events, or the state changes that the data represents. Even for simple “streaming ETL or STL” de-duplication of repetitive events requires a stateful processing model. This means that the event processor must manage the state of every object in the data stream. The state should of course be persisted for resilience, but since storage is slow, analysis, learning and prediction can proceed at memory speed concurrently with writes to disk.
Beyond simple transformations, analysis, learning and prediction on-the-fly let representations of state to be very powerful. As an example, a truck might be “predicted to be within” range of an inspector. The volume of data required to make the prediction might be huge, but the state required to represent it is tiny – just a few bits.
Databases can't analyze data as it is produced, and this is exacerbated in large distributed systems where concurrency, security, locking and consistency are managed by the database. While newer databases are able to scale well, they are always an additional network hop away, which introduces additional per-event latency of a factor of 106 longer than an in-memory update of a model that manages concurrency through powerful extensions, such as the Actor paradigm.
- Using brokers to manage storage is problematic – It is tempting to use event partitions as durable storage. However, at best the result is a time-series of raw data which is difficult to search or process other than through the broker. Using brokers to manage tiered archival storage is a bad idea because it detracts from their key purpose – timely delivery of new events.
- Data transfer
- Kafka/Pulsar (and Streaming) deliver raw data. Data must be transported to the broker cluster and then the streaming application needs to sift through the data to understand the relevant state changes in real world infrastructure before
- Swim uses an “analyze then store” architecture in which concurrent, stateful web-agents act as digital twins of real-world entities. Web-agents are resident in-memory and distributed over fabric of compute resources. Each statefully consumes raw data from the real world, and the set of web agents is a mirror of the real world.
- Data-to-state conversion
- In pub/sub streaming the application developer must explicitly manage data-to-state conversion, state storage, and (perhaps separately) analysis. This is complex, difficult to maintain, and understand.
- Swim applications are easy to build: The developer provides a simple object relationship model between the “things” in the data. When raw streaming data from some new entity in the real-world arrives, Swim creates a concurrent web agent – a digital twin – for the entity. The web agent concurrently processes raw data from that source and always maintains its current state. Web agents fluidly link to each other to express real-world relationships like containment and proximity; but linking also allows web agents to see each other’s current state, in real-time. In this way Swim builds the model of the real-world from data, and simultaneously translates data to state changes – simplifying analysis
- Pub/sub systems don’t analyze data
- Kafka/Pulsar use centralized (clustered) architectures for all event handling. This is separate from stream processing and analysis. A separate application performs analysis, perhaps using Apache Spark. The pub/sub system itself does not support analysis and has no analytical or machine learning algorithms built in.
- Swim web agents analyze, learn and predict using a rich set of analytical capabilities provided by the platform.
- Pub/sub systems are not processing platforms so their support for windowing, triggering and other time / event aligned processing is weak. Processing that is not aligned with arrival of events is messy.
- Swim transforms data into state transitions in real-time. Web agents are concurrent and can compute at any time and processing is not triggered by data arrival. Analysis can occur on a different time scale from data delivery / data-to-state conversion.
It’s important to note that Swim can consume data from Kafka/Pulsar, publish to them, or operate as a stream processor that builds a model of web agents on the fly from streaming updates. Swim provides the data-to-state conversion and analysis of data, before saving insights in a data lake or delivering them to other apps.