A look at how to build a scalable edge analytics solution that can accommodate a variety of IoT business cases and technical requirements and how to find the middle-ground between customisation and repeatability.
by Andrei Ciobotar, Analytics Director
Most IoT deployments can be summarized as a rather basic system that pulls data from devices, harmonizes and analyzes the data in the cloud or on the edge and presents this information in an efficient and simple-to-digest manner to its stakeholders.
The potential for variety and use-cases seems endless at a first look: connected coffee machines, air preheaters, elevators and many more consumer-grade or industrial-grade devices and groups of devices. Ultimately, all deployments can still be condensed to the same edge – cloud – insights architecture.
However, if you zoom into each of the above components, the complexity quickly becomes daunting. Devices are almost never identical, protocols come in a variety of flavors, on-site constraints differ from project to project and facility to facility and, most importantly, the business-centric outcomes that relayr is providing to its customers (and also defining together with them as part of the journey) make it essential to focus on pulling the right data, KPIs and metrics. It becomes challenging to reason about creating one product to rule them all and accommodate all of our requirements.
At relayr, we believe in developing tailored solutions that can deliver insights to power the right outcomes under the right constraints, for the right devices, but also in creating repeatability to scale our business. Repeatability ultimately also forces engineers to focus on coherent deployment pipelines and makes our software behavior more predictable and simpler to reason about. Instead of developing so-called “one-off” products, we focus our efforts towards developing flexible and composable components that have well defined interfaces and allow us to quickly stitch solutions together that are tailored for the project.
The challenge with data
Not all data is created equal. Unfortunately, in a cloud-centric environment, what matters and what doesn’t only becomes apparent once our payloads reach the cloud and go through expensive data transfer, processing and analysis pipelines. Device connectivity is not always guaranteed, bandwidth is still a scarce resource and scaling a cloud up to accommodate high throughputs for a large number of devices is an expensive exercise.
Mitigating the above requires us to deliver gateway software at the edge that brings intelligence closer to the data. This intelligence can be as simple as an “if-this-then-that” engine which allows us to formulate simple rules that trigger behaviour based on the incoming device data to more complex set-ups that involve audio analysis, pattern matching, ongoing calibration for unsupervised machine learning algorithms and even inference and training for deeper supervised models, if the hardware allows it.
As alluded to before, the large variety of data available at the edge, as well as the multitude of insights we may need to derive from it to deliver the business problem, require us to break our edge analytics product apart into more of a building-block collection of functionality groups that we can compose to generate tailored setups.
What do an air pre-heater and a coffee machine have in common? Not a lot, but they both generate time-series data that we can try and analyze with the same software.
We’ve designed a solution centred around so-called “functionality blocks”. What this ultimately boils down to is taking higher level data operations such as an audio pattern-matching engine, vibration analysis, anomaly detection etc. and breaking these operations down into lower-level reusable functionality blocks with well-defined input and output interfaces. These functionality blocks can be broken down into:
- a data transformation: bucketing into predefined ranges, applying a filter over a rolling window, applying a mathematical operation over a rolling window, a fast Fourier transform (FFT – moving from the time/space domain to the frequency domain), statistical quantities etc.
- a data flow element: delaying data (e.g. only emitting the payload to the next block after n time units), holding data until condition is satisfied (e.g. a buffer has reached its desired size or a data point of a specific type has arrived etc.)
- a correlator element: correlation or arithmetic operations against time series pairs
- a machine learning element: e.g. a neural network that is trained either on-edge or in the cloud and then shipped to the edge via OTA.
These blocks, in turn, need to satisfy several fundamental constraints:
- The data operations performed internally need to be expressed in a flexible-enough manner to allow us to “re-compose” the blocks depending on the project. In software architecture terms, this refers to a setup with low coupling and high cohesion. In more practical terms, this means that each data transformation module knows nothing about the inner workings of the other transformation modules and that our blocks are self-contained, independent and each has a single well-defined purpose. Blocks which are re-used in different setups and compositions across projects bring challenges with them and cascade effects when deploying code changes are especially challenging for our engineers to deal with.
- New data operations that arise as part of a customer engagement need to be translatable to other, potentially different, solutions if needed.
- Data transformations should not care about the incoming data specifics. Temperature, humidity, pressure are all time series and we don’t care about their semantics.
And ultimately, our final software deliverable has traits of its own that it needs to satisfy:
- Its performance needs to scale. What this ultimately boils down to is the ability to run on low-IPC CPUs as a common denominator(e.g. Cortex-A53) but leverage the increasing parallelism options that we see on modern CPUs.
- It needs to be simple and cheap to update. The word “cheap” encompasses both the on-gateway software complexity of deploying updates as well as the network I/O required to move updates from our cloud to the edge through an over-the-air (OTA) system, with the cloud acting as a software distribution repository.
- We need to support incremental and guided change as a first principle across multiple dimensions in our software. This boils down to embracing the principles of evolutionary architecture even for low footprint edge software.
As an exercise of imagination, if we were to map our edge intelligence solution to a large-scale cloud solution, our data blocks would translate to microservices and different customer instances of our edge product would translate to tailored systems consisting of the same set or subset of microservices. Developing artifacts for the fitness functions of the blocks and their compositions is an exercise much akin to developing integration and system tests for a cloud system.
We embrace infrastructure as code for our cloud services and a similar pattern for composing our edge blocks: JSON files. These JSON files describe high-level characteristics of the software such as data source and data sink, parallelism but also a directed acyclic graph (DAG). The DAG is a popular construct in graph theory and boils down to a finite number of edges and vertices with no cycles; in our case, the vertices consist of the functionality blocks described above and the edges describe the data flow between components as well as the individual configuration options for the blocks.
Some edge architectures are not very pretty to look at:
Others almost have an artistic grace to them, despite their complexity:
JSON files are shipped to our edge gateway devices together with the software binary; we can update our edge deployments by only shipping lightweight configuration files to the gateways through our over-the-air (OTA) update mechanism and restarting the software, allowing it to re-wire itself for the new data flow. This translates to significant savings on the data budget and bandwidth fronts and allows our engineers and data scientists to quickly and cheaply deploy quick fixes to our edge devices. Fundamentally different data flows can also be quickly added to the same gateway through this same process if e.g. new devices are connected or the device model changes and new measurements start being pulled. The expensive operations are reduced to pushing new binaries down to the devices if new blocks need to be added.
Lastly, having a hard split between software configuration and the software artifacts themselves allows us to develop new blocks as part of our product development efforts and define configurations as part of our customer engagement work. Delivering a solution becomes less of a question of development effort but rather of defining the right software behavior and a recipe for composing our software.
As part of our data processing pipeline, our system allows for embedding of machine learning modules into the architecture. These machine learning modules can be as simple as regressors or more complex neural networks, albeit shallower than what we typically see in a compute-rich cloud environment.These models accomplish three tasks:
- Anomaly Detector inference
- Anomaly Detector online training using batches in an unsupervised fashion
- Predictive System inference
- Intermediary step in a more complex data transformation e.g. a regressor used as part of a template matching pipeline
By defining machine learning modules as data processing blocks, we can easily design complex data pre-processing pipelines and feed the resulting features into models which can, themselves, take arbitrary positions in the pipeline and can run in a serial or stacked manner.
At relayr, we primarily focus on the TensorFlow backend for our deep learning models — we can focus on developing and refining the actual models rather than the complexity of data manipulation and its nitty gritty details. TensorFlow’s runtime interprets and executes so-called computation graphs, which are essentially its own representation of programs — these programs may read or write files, perform network IO and even spawn additional processes under the TensorFlow process in addition to the data manipulation logic. Mapping variables to tensor values is done through checkpoint files. These graphs and checkpoints are inherently portable and allow us to create systems that can be trained in a cloud environment and deployed to the edge for inference — through our OTA mechanism, we already have infrastructure in place to ship JSON files and binaries to our gateways. This same mechanism can be leveraged to either:
- Ship model graphs to the gateway. The network weights will be initialised to random values in this case and we rely on online learning to tune the network for incoming data.
- Ship model graphs and checkpoints. This scenario makes sense if the network has been trained offline by a data scientist or as part of our cloud infrastructure. In this case, the network is initialised with pre-set weights and is already usable. Further refinements can be accomplished through online learning as well.
We’ve looked at how to build a scalable edge analytics solution that can accommodate a variety of IoT business cases and technical requirements and how to find the middle-ground between customisation and repeatability. We’ve combined proven software engineering principles, simple mathematical constructs and cutting-edge data analysis tools to develop composable re-usable blocks and shape them into fully-fledged low-footprint edge analytics software at record speed. Lastly, we’ve scratched the surface of how deep learning fits into the above solutions and will take a deeper look at shipping models and transitioning them between cloud and edge in a future piece.
relayr has spent close to three years developing this system and we are excited to see it grow and continue to mature, as we address unexplored projects and the challenges we see on the horizon. 5G, AI Chiplets and other technologies that are exiting the incubation phase are certainly on our radar and will drive future designs and architectures. We fundamentally believe that edge will be a central component in the future technology landscape for connected assets and are working hard towards bringing a slice of that future to the IoT solutions of today.