Using Swim with Apache Kafka
by Simon Crosby, on May 11, 2021 9:15:00 AM
Event Streams are the Future of Big Data
(Tune in to my presentation to the Apache Kafka EU Summit on Wednesday 12 May, to learn how to build a Swim app in 10 minutes!)
Organizations want to derive continuous intelligence from their data streams for competitive advantage. Streams of events from their “always on” infrastructure, products and applications hold the clues they need, but most fail to find them in time. It’s not getting easier: Over 2M “smart” devices are connected per hour, and there are a billion mobile devices in our hands. Most data is only useful for a short time, but streams never stop.
How can users get continuous intelligence to inform and automate decisions in real-time? The answer relies on two key components - at the infrastructure and application layers.
Apache Kafka is a scalable event streaming platform that tames and unifies access to disparate streams. It enables any number of sources publish events about topics to a broker, and any number of applications to subscribe to topics of interest. Subscribers receive events in a topic ordered by arrival time. The simplicity of pub/sub, combined with Kafka’s robustness, allow a single infrastructure platform to support a diverse set of applications. Pub/sub is a unifier – receiving, storing, and buffering asynchronous events with consistent semantics - that frees application teams from the burden of event stream management.
Applications need a runtime platform that ensures they:
- Always have an answer (from the latest data): Algorithms must analyze, train and predict continuously - without supervision - on boundless event data. Analysis must be driven by arriving data, not by queries.
- Continuously and statefully analyze events:
- Each new event must be analyzed in real-time
- Analysis must be continuous since data streams are boundless.
- Insights should be continuous and form a real-time stream fed back to the broker
- Analyze events in an application context:
- Fluid relationships between data sources - like containment or geospatial proximity, and computed relationships like correlation - are critical for applications that reason about the joint meaning of events from many sources
- Applications need to be able to automatically build and continuously modify a model of the real-world – directly from the streaming data itself – because the real-world changes continuously
Our goal is to make it easy for application teams to process event streams in real-time, and in the context of an automatically created application layer model in which analysis, learning and prediction are done on-the-fly.
Stream Processing Application Requirements
Brokers don’t execute applications - they act as a buffer between the real world and applications. An event stream processor understands the semantics of events and performs analysis of the stream. It may deliver data to a machine learning pipeline, or be used to modify a relational or graph database: Ultimately the reasoning about the meaning of an event or set of events requires a stateful system model that captures the states of all entities that the application needs to analyze. Unfortunately, pub/sub introduces a few challenges:
- The app cannot control the arrival rate of events, so if it is too slow, it will fall behind the real world and results will be useless;
- How many topics does an app need? If more than one source adds to a topic, the app may have to process unimportant events before finding one that’s important, making it tricky to bound response times. But, if there’s only one source per topic the broker limit on the number of topics may get in the way.
Application layer insights that rely on logical or mathematical relationships between data sources and their states, are not the domain of the broker. Applications need a stateful model, which inevitably requires a database of some sort and makes response times dependent on the database round-trip time (RTT). It also poses another challenge: If the data-driven part of the app is tasked with updating the database, what triggers and runs analysis?
SwimOS: Continuous Intelligence from Event Streams
SwimOS is an Apache 2.0 licensed stream processor (in Apache Kafka terms) that delivers continuous intelligence from streaming event data. SwimOS
- Makes it easy to develop applications that statefully process streaming data on-the-fly to deliver high resolution insights for visualization, applications and persistence, analyzing, learning and predicting on-the-fly using an “analyze, act, and then store” architecture,
- Auto-scales application infrastructure based on event rates and computational complexity to meet real-time processing needs,
- Is fully integrated into the DevOps application lifecycle,
- Delivers unimaginable performance for real-time analysis, learning and prediction leveraging in-memory stateful processing and powerful stream-centric analysis that is optimized for real-time processing.
SwimOS uses an “analyze, act, and then store” architecture: This architecture reduces data volume, and delivers continuous analysis and prediction, while offering an application speedup of over 95% on less than 10% of the hardware of other stream processors. Insights are continually available – enabling a real time response.
The resulting graph is a bit like a “LinkedIn for things”: actors that are digital twins of real-world assets dynamically inter-link to form a graph, based on real-world relationships discovered in the data. These relationships may be specified by the developer (eg: an intersection contains lights, loops and pedestrian buttons). But data builds the graph: as data sources report their status, they are linked into the graph. Linked agents see each other’s state changes in real-time. Web agents continuously analyze, learn and predict from their own states and the states of agents they are linked to, and stream granular insights to UIs, brokers and enterprise applications.
SwimOS benefits from in-memory, stateful, concurrent computation that yields several orders of magnitude performance improvement over other stream processors. Streaming implementations of key analytical, learning and prediction algorithms are included in SwimOS, but it is also easy to interface to applications on Spark, where it replaces Spark Streaming mini-batches with an in-memory, stateful graph of Web Agents that can directly deliver RDDs to Spark, simplifying applications considerably.