Being a guiding credo of software development, "Agility" helps deliver software faster and in higher quality. The nature of existing ML development, however, hinders their agility. At Boltzbit, we develop a novel paradigm for ML software development. The paradigm completely changes and shortens the ML software development lifecycle, pushing it towards being real agile.
It's a lucky time that we are witnessing progress of AI at almost daily pace. Nevertheless, we are, unfortunately, far behind in terms of AI development tools and platforms. At the moment, many impressive AI outcomes are from academical institutes or labs, which suggests still a long way until final production. Undoubtedly, as happened in many other fields, AI comes to a bottleneck of lacking effective development methodologies and processes.
Botlzbit is dedicated to accelerate and automate the development of AI software solutions. We integrate the state-of-the-art AI research and modern software engineering practices, and come up with a more applicable paradigm to fit fast changing AI and business innovations.
Until now, there should be no doubt that AI or Machine Learning is disrupting the software development industry by offering much wider functionalities (most of which are far beyond what traditional softwares’ capabilities) and using a completely different approach. It’s noted that code is no longer the backbone of developing a ML or AI software. Instead, preparing high-quality data, managing experiments' executions and evaluations are more critical to the final deliverables. With latest deep learning libraries or frameworks, e.g. Tensorflow or Pytorch, most models can be built with less than 100 lines of code. On one hand, this really amazes people when they see a fancy face recognition software literally built upon a short code snippet. On the other hand, more thoughts were raised on the formulation of appropriate processes to ensure the effectiveness, quality and efficiency of this novel software development approach.
It’s a rather understandable nature for people to apply certain existing knowledge framework to new alternatives. However, after many attempts of exploiting existing ML workflows using the software engineering conventions, it’s realized that either some traditional software engineering concepts are no longer applicable, or some emerging components are not covered. This is not surprising if you look at the following diagram on how ML engineering development lifecycle differs from traditional software engineering. “Data” and “Model” are two new components in ML engineering. Arguably, more components involved in the pipeline implies that more efforts are needed to take care of them.
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light ……. ”
- Charles Dickens, A Tale of Two Cities
Agility is a well recognized and widely adopted principle in the software industry. Agility essentially emphasizes early, fast and frequent software delivery. In particular, continuous integration and continuous delivery (CI/CD) are two key inventions which bring the "doctrinaire” to real-world practice. In a nutshell, CI/CD is an engineering approach in which softwares are developed in short cycles and ensured to be reliably released at any time in a safe way.
The orchestration of CI/CD is very mature now for traditional software engineering and many tools are available to minimize manual effort. However, new challenges emerge when people attempt to add CI/CD in ML engineering workflow:
More complicated tests. Tests (particularly integration tests) are the very reason that softwares can be released reliably. In ML engineering, the scope of tests would be wider since AI is no longer code-based and more careful tests are necessary to avoid surprising outcomes. The following is a incomplete list of extra tests
Model performance test
Model fairness test
Model latency test
Cross-team collaboration. Usually, delivering an AI/ML software solution involves the effort and collaboration across different teams. For example, some small changes made by data team in data collection or transformation can lead to a failure of a new model. Inspecting and spotting the bug/flaw would not be as trivial as regular softwares.
Long data preparation and model training time. Agility is essentially fulfilled with fast and frequent software delivery. An agile team should quickly design and add code to implement new requirements or address certain bugs, which is followed by automated tests and (if all tests pass) deployment. In ML lifecycle, however, data preparation and model training are two bottleneck components in the whole pipeline. Usually, the changes in data preparation code or in model code could only be tested after hours or even days.
Modeling & Research challenge. In traditional software, whenever there is a new task, the development team will create a ticket and the team members will pick and work on the ticket when he/she has some bandwidth. In ML world, nevertheless, there exist some modeling and research barriers which bring in some challenge and uncertainty. As a matter of fact, adding a simple feature in ML softwares may require iterative new data preparation, model re-design and experiments until being finalized. The process could take days.
“Machine Learning applications are becoming popular in our industry, however the process for developing, deploying, and continuously improving them is more complex compared to more traditional software.”
- Danilo Sato, et al., MartinFlower.com
To take full use of CI/CD in ML engineering, new development methodologies and platforms are needed. At Boltzbit, we believe that the existing practices of AI software development are rather slow and hold back the pace of business and innovation. Therefore, we are introducing a new paradigm of AI software development by using Deep Generative Models. As illustrated in the following paradigm, data preparation and model training are moved out of the iterative pipeline, which significantly reduces the overhead of each iteration. Therefore, the real development of new tasks or requirements takes place in the same way as traditional software.
As shown in the following diagram, all the magics happen since there are no predefined tasks tied to model training. Instead, the goal of training is to understand different aspects of data. In other words, the model strives to understand the training dataset by generating a similar "shadow" dataset. Then, different downstream tasks can be easily fulfilled by conducting inference on the generative model.
Besides agility and CI/CD, there are other floating issues around ML engineering. Some of them are being taken care of while others are still stealthy. We will cover more related topics in the sequel, so stay tuned on our blogs.
“Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly ….. it is dangerous to think of these quick wins as coming for free. ….. it is common to incur massive ongoing maintenance costs in real-world ML systems. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns. ”
- D. Sculley, , et al., Hidden Technical Debt in Machine Learning Systems