Technology Insights

Towards Agile Machine Learning Engineering

Being a guiding credo of software development, "Agility" helps deliver software faster and in higher quality. The nature of existing ML development, however, hinders their agility. At Boltzbit, we develop a novel paradigm for ML software development. The paradigm completely changes and shortens the ML software development lifecycle, pushing it towards being real agile.

A new era: Machine Learning Engineering 

Until now, there should be no doubt that AI or Machine Learning is disrupting the software development industry by offering much wider functionalities (most of which are far beyond what traditional softwares’ capabilities) and using a completely different approach. It’s noted that code is no longer the backbone of developing a ML or AI software. Instead, preparing high-quality data, managing experiments' executions and evaluations are more critical to the final deliverables. With latest deep learning libraries or frameworks, e.g. Tensorflow or Pytorch, most models can be built with less than 100 lines of code. On one hand, this really amazes people when they see a fancy face recognition software literally built upon a short code snippet. On the other hand, more thoughts were raised on the formulation of appropriate processes to ensure the effectiveness, quality and efficiency of this novel software development approach.

It’s a rather understandable nature for people to apply certain existing knowledge framework to new alternatives. However, after many attempts of exploiting existing ML workflows using the software engineering conventions, it’s realized that either some traditional software engineering concepts are no longer applicable, or some emerging components are not covered. This is not surprising if you look at the following diagram on how ML engineering development lifecycle differs from traditional software engineering. “Data” and “Model” are two new components in ML engineering. Arguably, more components involved in the pipeline implies that more efforts are needed to take care of them.

Agility in Machine Learning Engineering 

Agility is a well recognized and widely adopted principle in the software industry. Agility essentially emphasizes early, fast and frequent software delivery. In particular, continuous integration and continuous delivery (CI/CD)  are two key inventions which bring the "doctrinaire” to real-world practice. In a nutshell, CI/CD is an engineering approach in which softwares are developed  in short cycles and ensured to be reliably released at any time in a safe way. 

The orchestration of CI/CD is very mature now for traditional software engineering and many tools are available to minimize manual effort.  However, new challenges emerge when people attempt to add CI/CD in ML engineering workflow: 

  • More complicated tests. Tests (particularly integration tests)  are the very reason that softwares can be released reliably. In ML engineering, the scope of tests would be wider since AI is no longer code-based and more careful tests are necessary to avoid surprising outcomes. The following is a incomplete list of extra tests 

    • Model performance test

    • Model fairness test 

    • Model latency test 

  • Cross-team collaboration. Usually, delivering an AI/ML software solution involves the effort and collaboration across different teams. For example, some small changes made by data team in data collection or transformation can lead to a failure of a new model. Inspecting and spotting the bug/flaw would not be as trivial as regular softwares. 

  • Long data preparation and model training time.  Agility is essentially fulfilled with fast and frequent software delivery.  An agile team should quickly design and add code to implement new requirements or address certain bugs, which is followed by automated tests and (if all tests pass) deployment. In ML lifecycle, however, data preparation and model training are two bottleneck components in the whole pipeline. Usually, the changes in data preparation code or in model code could only be tested after hours or even days.

  • Modeling & Research challenge.  In traditional software, whenever there is a new task, the development team will create a ticket and the team members will pick and work on the ticket when he/she has some bandwidth. In ML world, nevertheless, there exist some modeling and research barriers which bring in some challenge and uncertainty. As a matter of fact, adding a simple feature in ML softwares may require iterative new data preparation, model re-design and experiments until being finalized. The process could take days. 

Our insights & Solutions

To take full use of CI/CD in ML engineering, new development methodologies and platforms are needed. At Boltzbit, we believe that the existing practices of AI software development are rather slow and hold back the pace of business and innovation. Therefore, we are introducing a new paradigm of AI software development by using Deep Generative Models. As illustrated in the following paradigm, data preparation and model training are moved out of the iterative pipeline, which significantly reduces the overhead of each iteration.  Therefore, the real development of new tasks or requirements takes place in the same way as traditional software. 

As shown in the following diagram, all the magics happen since there are no predefined tasks tied to model training. Instead, the goal of training is to understand different aspects of data. In other words, the model strives to understand the training dataset by generating a similar "shadow" dataset. Then, different downstream tasks can be easily fulfilled by conducting inference on the generative model. 

What's more ...

Besides agility and CI/CD, there are other floating issues around ML engineering. Some of them are being taken care of while others are still stealthy.  We will cover more related topics in the sequel, so stay tuned on our blogs.