Technology Insights

Confident ML in Production with Generative Data Validation

As a last assurance in production, data validation is employed by seasoned ML practitioners to avoid confusing outputs. At Boltzbit, we develop and advocate a novel ML software development paradigm. One practical merit of the paradigm is that data validation almost comes for free, which results in simpler and more confident ML productization.

Using ML with scrutiny 

Undoubtedly, AI or ML can fulfil many functions beyond traditional softwares, e.g. image recognition, item recommendation and even task planning.  In real-world industries, however, the employment of ML is never as simple and straightforward as what we see in ML textbooks (i.e. users feed in input data and get outputs from ML models). The beauty of ML-model-based systems is that we don’t have to define any rules regarding how to function. Instead, the systems’ behaviors are totally determined by how the models are trained. In other words,  training data, ML mode code and  training configurations together work like a black box that produces the final model.

Directly exposing ML model (e.g. via rest APIs) to users or downstream processes is risky since it might produce confusing and disappointing outputs if the test input data can not be a sample from the empirical distribution of training data.Therefore, data validation* and fallback mechanisms are always added to ML systems.;

(* the scope of data validation in this blog does not cover data schema checking

Data Validation: status quo

Although the importance of data validation is well recognized in industries. Nevertheless, at the moment,  there is no standard data validation method. Usually, different validations are developed and employed in practice depending on the nature of the tasks. A simple and popular approach is Training-Serving Skew Detection. A mean and standard deviation are computed for each feature during training time. Then, a test data is considered good if all its features are within the range of its mean +/- standard deviation. This approach is widely adopted due to its simplicity. Nevertheless, its cons are also quite obvious. It ignores higher-order statistics within the data, e.g. feature pairwise correlations. For example, a test data is supposed to be filtered out if two features are not compatible even if they are in safe ranges respectively. Therefore, there would be many false positives. Sometimes, 2nd-order statistics checking are added to compensate for the above deficiency. However, it’s expensive and impractical when the number of features gets large (which often is the case in practice). 

At the same time, another notable approach used in industries is Reconstructive Data Validation. Usually an AutoEncoder is utilized to fit data by minimizing a reconstructive loss. More concretely, if a test data follows the training data empirical distribution, it can be encoded to an appropriate bottleneck representation, which is then decoded to a reconstruction. This approach is advantageous since it is expected to learn more statistics from data such that it can generate good reconstruction. However, AutoEncoder is discriminatively trained and it involves both encoder and decoder. Therefore, without careful regularization or data augmentation (e.g. denoising AutoEncoder) , AutoEncoder-based data validation sometimes could be over sensitive, which leads to false negatives. 

Our Insights & Solutions: Generative Data Validation

At Boltzbit, we exploit deep generative models for better MLOps. More content on our new ML development philosophy can be found in our other blog.  In a nutshell, deep generative models are trained on data without being tied to any specific tasks.  As for data validation, deep generative models can effortlessly fulfil it as a built-in functionality.  As illustrated in the following diagram, a Deep Latent Variable Model (DLVM) can be trained to fit training data empirical distribution by generating them with high likelihood.  Since the DLVM explicitly learns the data distribution,it naturally captures the data's statistical characteristics. 

In conclusion, our novel ML development paradigm and generative data validation bring in both theoretical and practical strengths. How the synergy works and deployed in practice is illustrated in the following diagram. Once a model is learned on the training data, model productionization and deployment is lighter-weight and more reliable.