Forget Data - It's State That Matters

by Simon Crosby, on Aug 6, 2021 9:06:33 AM

Streaming data contains events that are updates to the states of applications, devices or infrastructure. When choosing an architecture to process events, the role of the broker, such as Apache Kafka or Pulsar, is crucial - it has to scale and meet application performance needs - but it's necessarily limited to the data domain.   Even using a stream processing capability such as Kafka streams that triggers computation for events that match rules, leaves an enormous amount of complexity for the developer to manage - and that's all about understanding the state of the system.  Here's why you care about the state of the system and not its raw data:

  1. Data is often noisy – Many real-world sources are noisy and repetitive; and large numbers of data sources add up to a huge amount of data. If raw data is delivered as a stream of pubs from edge devices, the cost of transporting raw data to the broker and then to analysis can be huge. Storing data at a broker and managing replication in a broker cluster is both expensive and complex.

    It is important to clean data at the earliest possible opportunity. Cleaning is not simply data reduction by throwing away repetitive raw data. By analyzing, learning and predicting on the fly you can achieve a massive (like 5 orders of magnitude) reduction in volumes.  Clouds charge for ingress data, but they flay you alive for egress.

    In the largest Swim deployment we are aware of, over 4PB (that’s 4x1015) of streaming data per day are ingested at a mobile provider's edge (MEC) to enable them to analyze, learn and predict connection qualities for mobile devices, and to optimize customer connection quality. Storing this volume of data is simply unthinkable. Instead, learning on-the-fly, close to data sources allows Swim to achieve a factor of 105 reduction both by reduction of data volumes and by predicting the future states of 150M devices.

  2. Applications care about state: Streaming environments never stop producing data –  but responses depend on the meaning of those events, or the state changes that the data represents.  This means that the event processor must manage the state of every object in the data stream. The state should of course be persisted for resilience, but since storage is slow, analysis, learning and prediction can proceed at memory speed concurrently with writes to disk. Beyond simple transformations, analysis, learning and prediction on-the-fly permit representations of state to be very powerful. As an example, a truck might be “predicted to be within” range of an inspector. The volume of data required to make the prediction might be huge, but the state required to represent it is tiny – just a few bits.

    Databases don't analyze data to find dynamic associations and relationships, and this is exacerbated in large distributed systems where concurrency, security, locking and consistency are managed by the database. While newer databases are able to scale well, they are always an additional network hop away, which introduces additional per-event latency of a factor of 106 longer than an in-memory update of a model that manages concurrency through powerful extensions, such as the Actor paradigm.

  3. Using brokers to manage storage is problematic – It is tempting to use event partitions as durable storage. However, at best the result is a time-series of raw data which is difficult to search or process other than through the broker. Using brokers to manage tiered archival storage is a bad idea because it detracts from their key purpose – timely delivery of new events. 

To summarize:

  1. Data transfer:
    Brokers operate on raw data. This means that data must be transported to the broker cluster and then the streaming application needs to sift through the data to understand the relevant state changes in real world infrastructure before analysis.  Vital, but it's the starter, not the meal.

    Swim uses an “analyze then store” architecture in which concurrent, stateful web-agents act as digital twins of real-world entities. Web-agents are resident in-memory and distributed over fabric of compute resources. Each statefully consumes raw data from the real world, and the set of web agents is a mirror of the real world.   

  2. Data-to-state conversion
    In pub/sub streaming the application developer must explicitly manage data-to-state conversion, state storage, and (perhaps separately) analysis. This is complex, difficult to maintain, and understand.

    Swim changes that.  Swim applications are easy to build: The developer provides a simple object relationship model between the “things” in the data. When raw streaming data from some new entity in the real-world arrives, Swim creates a concurrent web agent – a digital twin – for the entity. The web agent concurrently processes raw data from that source and always maintains its current state. Web agents link to each other to express real-world relationships like containment and proximity; but linking also allows web agents to see each other’s current state, in real-time. In this way Swim builds the model of the real-world from data, and simultaneously translates data to state changes – simplifying analysis

  3. Brokers don’t analyze data
    Most brokers use  centralized (but clustered) architectures for all event handling. This is separate from stream processing and analysis. A separate application performs analysis. Brokers don't support analysis and have  no analytical or machine learning algorithms built in.  Swim web agents analyze, learn and predict using a rich set of analytical capabilities provided by the platform or other analytical frameworks like Spark, Flink and so on.

  4. Time
    Brokers are not processing platforms so their support for windowing, triggering and other time / event aligned processing is weak. Processing that is not aligned with arrival of events is messy.

    Swim Web Agents are actors that  transform data into state in real-time. Web Agents are concurrent and can compute at any time - processing is not triggered by data arrival. Analysis can occur on a different time scale from data delivery / data-to-state conversion.

 

 

Topics:Actor Paradigmapache kafka

More...

Subscribe