Reducing technical debt for ML platforms
Deploying machine learning models at scale is a time-consuming process that involves many stages of simulations and stress testing. Continuous testing is needed to ensure that the engineers’ ML Models are performing as anticipated in production - especially monitoring data/model drift. What if the data scientists want to put their latest model enhancements to the test in a simulated near-production environment?
For this, a workflow is essential that can build the said environment as soon as a new model prototype is pushed. These processes must also be constructed in a manner that does not demand too much manual intervention or involvement of the SRE team.
At Episource, we have developed a CI/CD pipeline to help data scientists host their models as APIs on-demand in a production-like environment. AWS ECS is the service that facilitates the deployment of our containers. Our CI/CD rolls out the test environment to host the Model API as soon as the engineer pushes the code into Github. The data scientist can run as many simulations as they want before agreeing on the efficacy of the latest work. This also makes it immensely straightforward to promote the new ML model to production at click of a button. This talk will go over how we developed a scalable simulation pipeline for our data scientists while adhering to the mantra - ship faster, ship consistent code, and ship fearlessly. Enabling production-like test environments necessitates stateless resource provisioning, which, if not performed in an automated environment, may result in subtle but significant drifts in production environments.
The following are some of the things that a participant can expect to learn during this talk:
Design parameters for ML deployment pipelines
Automation using Github Actions
Terraform usage for CI/CD jobs
Scalability: How do we ensure that our experiments are not competing for resources?