The Real World is Stateful

by Simon Crosby, on Feb 26, 2019 10:49:31 AM

The web and IaaS clouds only work because of REST and stateless computing. That and a couple of other infrastructure abstractions: load balancing, and a database of some sort. The idea is simple: a load balancer at the service URI picks up a request and redirects it to a server that has capacity to serve it.  (The server could be running a “serverless” stack - another layer of indirection that permits any worker to dynamically bind to and run your application code to serve the request.)  



"Serverless is cool and all, but the real world is stateful," says Swim CTO, Simon Crosby.

Necessarily the computational model is stateless – there is no way to preserve program state between invocations of the same code - because the value of the abstraction arises from late binding – “for each new request, find a server that can do the work, find the service code, run it”. Moreover, for an application at scale, many servers might simultaneously be executing the same service – and synchronizing execution state between them would be infeasible. Any state the application needs to keep across requests must be saved in a database. Scaling database services to deal with web-scale applications has been a major focus of the cloud community and vendors for a decade. I recall a conversation with Amazon CTO Werner Vogels in 2009 where he likened their efforts to “… swapping out a Cessna for a Boeing in mid-flight, without losing any passengers”. By any metric these efforts have been profoundly successful.

But will these abstractions continue to work in the world of the near future, where everything has a CPU, is connected, and produces a never-ending stream of data?  ARM now licenses 20BN CPU cores per year.  Can we scale database technology another several of orders of magnitude?  And even if we could, will it be acceptable for a device to experience a 50—100ms delay per event due to RESTful services (serverless / app and database)?  Can we continue to retrofit a batch-centric analytical framework on boundless data sources? I fear we’ve hit some limits.  There is good news though: For problems that are distributed, that need real-time insights from streaming data, easy development and deployment, and affordable execution, Swim offers an answer. 

Why Swim is Different

At its core, Swim makes a very simple architectural change: Stateful processing.  Instead of storing the state of each thing (entity) in a row of a centralized database, Swim adopts a simple, well-established pattern: For each thing found in the data, Swim creates an active digital twin (we call them web-agents, but I’ll stick with twin for now) close to the data source, that statefully processes each event.  The digital twin changes state at the same rate as the real-world thing – using local computation at local CPU & memory speed.  In this way Swim automatically builds a distributed digital twin model that is always a mirror of the real world. 

Each digital twin replaces a row for the equivalent entity in the database, and when the twin processes data from its real-world sibling, it modifies its own state - effectively updating some columns.  Rather than a single database, with locking, scaling, resilience, replication and distribution challenges, Swim thus adopts a distributed, stateful model of digital twins, each of which processes its own data, lock-free and at memory speed. The model is built by the data, as is generated.  The states of the distributed digital twins are effectively the same, in aggregate, as the database that they replace.  Except that now they can each independently change state, compute and communicate with each other – at memory speed on local CPUs, across a distributed, real-time model of the evolving real-world environment. 

Of course, you can save the data if you want – but there’s no good reason to do so on the “hot path”.  Data storage for later use can occur asynchronously.  Instead, each digital twin can perform streaming analysis, share data, and deliver streamed updates to other digital twins, applications and users - with the latency of a single network hop.  

Learn More

Swim is open source software that offers everything you need to build massively scalable, distributed, real-time edge applications.  You can get Swim here.

Topics:Machine LearningStateful ApplicationsSWIM SoftwareIndustrial IOTEdge AnalyticsAsset TrackingAsset ManagementEdge ComputingSwim Enterprisedistributed computing