Cap-auto-feature: Scalable feature store for training CRM machine learning model operations
Feature engineering is the heart of modeling, especially for tabular datasets. While modelling it’s often a good idea to add historical data on top of the contextual data, this makes data more rich and robust for all kinds of machine learning problems. Using centralized feature engineering we can achieve this. A good feature impacts the results of the model significantly. It can help to add domain knowledge, better represent the data for training with interpretable features, and can potentially impact the model results. In Capillary technologies, we have built a framework called Cap-auto-feature to manage and incrementally calculate features and in the process have significantly reduced the feature calculation time.
In the context of retail predictive model training and inferencing, the feature calculation is an integral part. Currently, many of the feature calculation implementations of a machine learning pipeline are localized to the model. Observing each model, one can see a lot of common features between them, like recency, latency, frequency, last purchases, etc. Having different versions of the same features at different places could potentially lead to different results and increase the margin of error. Cap-auto-feature solves this problem by a centralized framework in which each feature can be predefined. Using this, the idea is to have the single framework at disposal for a new model creation or to be used by existing models. Moreover, through this approach we can get a step further and apply incremental feature generation on top of it which will use the module to calculate the features and apply incremental calculation of features on top of it for subsequent training and inferencing. Incremental feature calculation uses mathematical and statistical methods to calculate features incrementally hence saving a lot of time in subsequent feature calculation after initial feature calculation.
Cap-auto-feature has three main components, (1) table-definition, (2) feature-definition and (3) feature-calculation. The table-definition contains a set of all the tables which are to be used for feature calculation. It also stores the relationship between the tables which comes in handy when there is a feature that is dependent on multiple tables. The tables are spark dataframe which are broadcasted/repartitioned according to the needs and the size of the table. This helps in improving the computation time of each feature calculation.
The feature-definition is the place where all our important features are defined. A feature can be added to it when needed and can be available for computation. A feature, by its nature can be a groupByAggregate feature, a transformFeature, a LagFeature etc. In addition to type, a feature-definition includes the table dependency for that feature, any other dependency with other features, group by columns, any filter criteria and the actual operation to be applied for the computation. All this information helps in its calculation.
The feature-calculation comprises several steps towards feature computation. It starts with adding any dependencies to the feature, i.e. adding any feature which are dependent on the target feature provided by the user. Then based on the features, it creates a dependency graph and using that dependency graph it selects a batch for processing. This batch can contain any type of features. Then this batch is grouped and each group is passed for the calculation. After the batch feature is calculated, all the features of the batch are merged and then the next batch is chosen and so on. Finally after no batch is left all the features are merged for getting the final features.
Using such a centralized feature store not only helps in faster modelling as the data scientists don’t have to worry about the feature calculation but also can provide transparency in development of features as well as a central book-keeping of all the features used by the team or an organization. It helps in removing any code duplication, and reduces the complexity/time of the feature calculation process.