As a last assurance in production, data validation is employed by seasoned ML practitioners to avoid confusing outputs. At Boltzbit, we develop and advocate a novel ML software development paradigm. One practical merit of the paradigm is that data validation almost comes for free, which results in simpler and more confident ML productization.
One can never be too cautious when deploying ML models. Data validation is quite an adept step in data science or ML practices. It basically checks how likely a model can work reasonably on input data. ML models’ outputs could be totally random if they never see the test input data in the training phase, which is likely in reality. Therefore, directly exposing ML model API to users is avoided in real-world industries. Data validation is an advantageous, defensive measure to intercept and check data before feeding them into ML models.
Along with the existing ML development workflows, data validation is nontrivial since it involves extra engineering infrastructures or models. At Boltzbit, on top of our proprietary methodology of ML software development, we introduce a more reliable and maintainable data validation approach. Data validation is a built-in function which requires no effort in the pipeline.
Undoubtedly, AI or ML can fulfil many functions beyond traditional softwares, e.g. image recognition, item recommendation and even task planning. In real-world industries, however, the employment of ML is never as simple and straightforward as what we see in ML textbooks (i.e. users feed in input data and get outputs from ML models). The beauty of ML-model-based systems is that we don’t have to define any rules regarding how to function. Instead, the systems’ behaviors are totally determined by how the models are trained. In other words, training data, ML mode code and training configurations together work like a black box that produces the final model.
Directly exposing ML model (e.g. via rest APIs) to users or downstream processes is risky since it might produce confusing and disappointing outputs if the test input data can not be a sample from the empirical distribution of training data.Therefore, data validation* and fallback mechanisms are always added to ML systems.;
(* the scope of data validation in this blog does not cover data schema checking)
"While a great deal of machine learning research has focused on improving the accuracy and efficiency of training and inference algorithms, there is less attention in the equally important problem of monitoring the quality of data fed to machine learning. The importance of this problem is hard to dispute: errors in the input data can nullify any benefits on speed and accuracy for training and inference. "
- Eric Breck, et al., Data Validation for Machine Learning.
Although the importance of data validation is well recognized in industries. Nevertheless, at the moment, there is no standard data validation method. Usually, different validations are developed and employed in practice depending on the nature of the tasks. A simple and popular approach is Training-Serving Skew Detection. A mean and standard deviation are computed for each feature during training time. Then, a test data is considered good if all its features are within the range of its mean +/- standard deviation. This approach is widely adopted due to its simplicity. Nevertheless, its cons are also quite obvious. It ignores higher-order statistics within the data, e.g. feature pairwise correlations. For example, a test data is supposed to be filtered out if two features are not compatible even if they are in safe ranges respectively. Therefore, there would be many false positives. Sometimes, 2nd-order statistics checking are added to compensate for the above deficiency. However, it’s expensive and impractical when the number of features gets large (which often is the case in practice).
At the same time, another notable approach used in industries is Reconstructive Data Validation. Usually an AutoEncoder is utilized to fit data by minimizing a reconstructive loss. More concretely, if a test data follows the training data empirical distribution, it can be encoded to an appropriate bottleneck representation, which is then decoded to a reconstruction. This approach is advantageous since it is expected to learn more statistics from data such that it can generate good reconstruction. However, AutoEncoder is discriminatively trained and it involves both encoder and decoder. Therefore, without careful regularization or data augmentation (e.g. denoising AutoEncoder) , AutoEncoder-based data validation sometimes could be over sensitive, which leads to false negatives.
“We have observed production machine learning systems at Google with training- serving skew that negatively impacts performance. The best solution is to explicitly monitor it so that system and data changes don’t introduce skew unnoticed.”
- Martin Zinkevich, Rules of Machine Learning: Best Practices for ML Engineering
At Boltzbit, we exploit deep generative models for better MLOps. More content on our new ML development philosophy can be found in our other blog. In a nutshell, deep generative models are trained on data without being tied to any specific tasks. As for data validation, deep generative models can effortlessly fulfil it as a built-in functionality. As illustrated in the following diagram, a Deep Latent Variable Model (DLVM) can be trained to fit training data empirical distribution by generating them with high likelihood. Since the DLVM explicitly learns the data distribution,it naturally captures the data's statistical characteristics.
In conclusion, our novel ML development paradigm and generative data validation bring in both theoretical and practical strengths. How the synergy works and deployed in practice is illustrated in the following diagram. Once a model is learned on the training data, model productionization and deployment is lighter-weight and more reliable.