As Jane Austen would say, it is a truth universally acknowledged, that a machine learning model is only as good as the data it receives. This applies both to the data used for training it, and also to the data fed into it at inference time. The former case is self-evident; here we shall delve on the latter.

Say you have developed, trained, and deployed into production a good model which achieves good results on the right metrics: you have only done half of the work. In fact, you have to continuously monitor the performance of the model and be sure that it stays in top notch conditions since there are plenty of troubles which might creep in: the most common example is when Drift, both data and concept, occurs and the model performance start degrading over time, but other hurdles might pop up as well.

data integrity

What is Data Integrity 

An aspect that is often overlooked is the quality of data sent as input to a model at inference time: if this data is corrupted or anyway not of good quality, even the best model will output poor predictions.
Helicon, Radicalbit’s flagship MLOps platform, has a specific solution for checking data integrity and always keeping a deployed model in spick and span conditions.

Let us talk about Data Integrity: what this is exactly, and how Helicon can help you keep it monitored at all times.

Given an array of features sent as input to the model, there are several aspects that have to be checked to ensure its integrity, including:

  • data type;
  • range and categories: no overstepping of bounds;
  • schema evolution;
  • rate of nulls, i.e. number of missing values;
  • outliers.

In order to better understand all these conditions consider the following example: there is a chemical plant fitted with plenty of sensors which sends a stream of real-time readings to a model. The features sent as input to the model are:

  • temperature in Celsius degrees (variable of type double),
  • pressure (variable of type double),
  • number of pellets of fuel fed to the boiler (variable of type integer),
  • level of risk for the chemical reaction in progress (‘green’, ‘orange’, ‘red’)
    and the output is how long the production process will take in hours.

Let us consider all the aforementioned aspects of Data Integrity.

Data type

If the thermometer reads a value of 20, but for some reasons it is sent to the model as ’20’ (viz. as a string) the data type is incorrect and the model is not able to use this information. An alert must be triggered to inform users about the data type problem in this specific record.
Likewise, if the value for the number of pellets is not an integer (e.g. 3.5) an alert with this specific information is triggered and sent to the users, who can hence take appropriate actions to fix the problem.

Range and categories: no overstepping of bounds

Imagine the thermometer sends a value of -300: this is clearly an incorrect reading since it is below 0° Kelvin. Likewise, if the level of risk has value equal to ‘cherry’, this is an incorrect reading as well, since the only admissible values are ‘green’, ‘orange’ and ‘red’.

In Helicon one can set up rules defining ranges for numerical variables and a list of admissible values for categorical ones: if a violation occurs an alert is triggered, with all relevant information sent to the specific alert channel.
Clearly these rules can be updated, and for instance ranges can be modified.

Schema evolution

In Helicon we offer a fully fledged solution for carrying out both data pre-processing (before model inference) and post-processing (after model inference), and therefore there is a complete AI data pipeline. The data are encoded as an AVRO schema, which can evolve if new fields (which must be optional, i.e. nullable) are added after its creation. Now, imagine that the chemical plant adds new sensors for reading the humidity level and therefore the schema evolves. Clearly it is crucial to receive an alert in order to update pre and post-processing pipeline, and, if desired, to create a new version of the ML model trained with the new variable as well when enough training data is collected.

Rate of nulls

So far we have dealt with data integrity related to a single data record, but more complex verifications can be carried out as well.
Helicon users can set up rules related to the rate of null values in a given data window, so for instance if in the last received 100 readings there are more than 10 null values, an alert is triggered and all information about the null values are sent to the chosen alert channel. This is especially important because an increase in the rate of null values can signal a problem with the sensors or over the communication lines, and therefore it is paramount to fix it as soon as possible.

Outliers

An outlier is a data point that differs significantly from the others, i.e. is far from the regions where the training data is the most concentrated.
The crucial difference between outliers and the data described in «Range and categories: no overstepping of bounds» is that the former is a data point that can conceivably be received, although with a very low likelihood, whereas the latter are data points with values that are surely wrong.
Detecting outliers is therefore important because, while they might be incorrect values, they might also be novel data, i.e. unusual but correct observations, and in this case they may indicate that anomalies are creeping in the data.
Inside Helicon, sophisticated techniques and algorithms are included to detect outliers and trigger alerts when they are spotted, to help you keep a full control of all data coming in.

If you are interested in knowing more about Helicon, its top notch monitoring solution and how it can help you with real time monitoring of data integrity, do not hesitate to reach out.

Radicalbit joins Big Data Conference 2023!

Radicalbit joins Big Data Conference 2023!

We are thrilled to share the exciting news that Radicalbit will be participating as a speaker at the upcoming Big Data Conference taking place in Vilnius from November 22nd to 24th! Our Senior Data Scientist, Mauro Mariniello, will be taking the stage on November 23rd...

Why do we need MLOps?

Why do we need MLOps?

    MLOps and AI infrastructures are topics that have been widely discussed in recent months, even more so after the rise of technologies around LLMs like ChatGPT. In this blog post, we're going to give a short and gentle introduction to these concepts by...

Radicalbit joins World Summit AI 2023!

Radicalbit joins World Summit AI 2023!

We are excited to announce that Radicalbit is all set to join World Summit AI 2023 from 11th to 12th October 2023 as Bronze Event Partner and speaker! Alessandro Conflitti, Head of Data Science at Radicalbit, will take the stage with his talk: "Zen & the Art of...

How Helicon Applications reduce TTM

How Helicon Applications reduce TTM

  We are proud to introduce Applications, the new, exciting feature included in Helicon, Radicalbit’s MLOps platform.   Thanks to Applications, it is now possible to present all the artifacts implemented within the platform (streams, pipelines and models) as...