Let Streaming Events Build Your Dataflow Pipeline - On-the-fly

by Simon Crosby, on May 26, 2021 9:00:00 AM

Brokers don’t run applications - they just act as a buffer between the real world and an application that analyzes them. An event stream processor (in Apache Kafka/Pulsar terms) or dataflow pipeline is an application that, given the event schema, analyzes the stream continuously to derive insights. 

Such applications are necessarily stateful: They continuously modify a system model and deliver insights that result from analyzing the effects of state changes on the part of data sources in context of the previous state of the system.  Any useful analysis requires knowledge of the past, and may span many events or data sources.  Building a dataflow pipeline to analyze events is what building a stream processing application is all about.  

Different toolsets have a huge effect on the ease of application development and the quality of insights that a stream processor can deliver.  The actor model (Swim, Akka) has substantial advantages over a DevOps-y Micro-Service-y approach using, say, Apache Flink or even a streaming SQL-based approach for event stream analysis because 

  • The actor runtime makes it easy to build a stateful dataflow pipeline directly from the event stream by dynamically creating and interconnecting actors to process events.  In contrast, a predefined set of entity relationships (a Schema) cannot track fluid relationships 
  • Relationships between data sources can be easily modeled in the actor paradigm by enabling related actors to communicate:  In Akka through message passing and in Swim through dynamic links.
  • Understanding the real world always involves finding the joint state changes of multiple data sources, often in time/space (eg: they may be correlated, or geospatially “near”). Actors that are dynamically related (as changes occur) can be identified and wired up by the runtime to build a system model, on the fly
  • Deep insights always rely on finding and evaluating the effects of joint changes: Traffic prediction using real-time learning over the states of all sensors at an intersection and at all neighboring intersections enhances accuracy.  When a sensor changes state this affects both predictions for its own intersection and all neighboring intersections.

Applications that rely on logical, mathematical or learned (it's just different math) relationships between data sources need a stateful model of the system, and it’s common to find a database of some sort at the heart of every dataflow pipeline.  Update latencies are dependent on the database round-trip time, which is inevitably a million times slower than the CPU and memory, but there’s an even more tricky challenge: If events drive database updates, what triggers and runs the resulting analysis?  To be clear, it’s simple to update a database row for a source when the app receives an event reporting a state change, but the computation of the effects of the change on the system - the resulting dependencies - is impossible, because a database doesn’t store those.  Time - and behavior over time - play a critical role in understanding any dynamic system, and databases (including time-series databases) don’t help with analysis over time.

The types of applications that can be built with and executed by any stream processing platform are limited by how well the platform manages streams, computational state, and time.  We also need to consider the computational richness the platform supports - the kinds of transformations that you can do - and the contextual awareness during analysis, which dramatically affects performance for some kinds of computation.