CapFlow: A scalable ML framework for training and serving CRM machine learning model operations
In Capillary Technologies, we have around 30 different models to serve different use cases like recommendations, personalization, insight, decision-making, and several other retail CRM predictions. We have hundreds of customers and terabytes of data that we consume to train and serve these models through online and offline inference. To scale to such a level and cater to continuous training and delivery of results, we have built our own MLOps framework and call it CapFlow.
1) The following key points are tackled in our CapFlow Framework:
2) Scaling to a large volume of data processing.
3) Efficiently scaling the training and deployment process to handle a large number of customers.
4) Deploy techniques to monitor model performance.
5) Deploy techniques to automate the model building process.
6) Plugging new models to the suit efficiently and validate across multiple customers.
7) Continuous delivery of model results.
8) Flexibility to use various models like Tensorflow/PyTorch-based Deep learning models, decision tree-based boosting models, statistical/rule-based models, and ensemble models.
9) Should be able to work with diverse types of data like, structured, unstructured, image and audio.
Following is the description of the different components of the pipeline as shown in the figure above:
1) Data Access Platform:
Data access platform (DAP) is a spark based data reading and preprocessing platform.
DAP is developed to handle different data sources (S3, CSV, SQL, No-SQL)
Exposes uniform structured table/view API irrespective of the data source to the subsequent layers. This platform creates a uniform structure or view for data processing or training to work.
2) Data Validation Framework:
Capillary has data from various types of clients like F&B, Fashion, Footwear, etc. Before we proceed with data preprocessing and feature generation, we have designed a Data Validation Framework module where we perform validation checks on our clients’ data. A separate pipeline for Data Validation Framework is designed using Apache Airflow, and these validation checks are executed on the Databricks cluster. In Data Validation Framework, we have two levels of inspections,
a) Generic Data Validation
b) Model-specific Data Validation.
In Generic Data Validation:
We plot the Recency distribution of users, Latency distribution of users, Repeat Customer Distribution, and other Key Performance Indicators such as average_spent_per_visit, total_bill_amount, line_item_count, etc.
In Model-specific Data Validation:
For e.g., some models employ embeddings generated (using universal sentence encoder or Glove embeddings) from user/item textual description. To train such models, we need enough item description data, and to determine it, we have designed attribute fill rate KPI, which validates words in the item description.
Once both Generic and Model-specific Data Validation checks are compliant, we proceed with
training and inference.
3) Data Cleaning and Processing:
Data preprocessing and feature generation are two of the most crucial steps in any end-to-end ML model. We have developed a separate data-preprocessing module in spark on scala for data preprocessing and feature generation for all of our ML models.
Data processing module communicates with the DAP SDK to accept appropriate data, in this phase it mostly reads structured data produced by the DAP. We have incorporated various feature generation techniques such as recursive feature generation, RFM features, User Item frequency matrix and many other features which are important for the model building.
Spark on scala offers concurrency support, which is the key in parallelizing the processing of large data sets. And several of Spark’s high-performance data frameworks are written in Scala. We run our heavy data preprocessing and feature generation on Databricks clusters. We create and deploy the Jar of our data preprocessing module on Databricks clusters using the developed data generation methods. Databricks supports autoscaling, which enables clusters to resize automatically based on workloads, making it cost-efficient.
5) Model Building:
Once the business problem is identified, we need to collect relevant data points which can be used for feature engineering. Feature selection can be a recurring process where we choose the most important features and train over them. Once the important features are selected, cross-validation and auto-tuning ensures that the training process is robust and scales to multiple customers without much intervention. The algorithm used for the model building depends on the type of target we are trying to predict. The trained model is stored in the s3 path, and the path is written to the database. This model is then used for inferencing, where we load the pre-trained model and predict for given data points. We have built training libraries based on the business level model objective, and the library supports both deep learning and machine learning capabilities.
6) Feature and Model Storage:
Once the training process is over, we deploy the model automatically based on some thresholds and rules and the version changes to the new model built. The feature columns selected during the training process are also stored for reference during inference. Some of the non-model-specific parameters are also stored for future reference.
7) Batch Inferencing:
In our case, the inference is a process where we generate results or predictions by inferring the already trained model. In model training steps, we generate optimum features and parameters. We infer this model to generate results from these models and write to the appropriate destination for further consumption by other services. And we generally retrain a process until we see a dip in the performance of the model.
Simplified ML Pipeline with tech stack:
All the above modules are pipelined through Airflow on Kubernetes. We leverage Databricks Jobs and cluster APIs to manage the Databricks jobs from airflow DAG. A training and inference scheduling framework takes care of load management and submits airflow jobs based on the currently running active DAGs. Since we are running airflow on Kubernetes, we have plans to leverage Kubernetes executors for airflow to scale to multiple DAGs horizontally.