Submissions for MLOps November edition
On ML workflows, tools, automation and running ML in production
Kartikeya Sharma
Level: Beginner
Timing: 15 min
The insights generated by ML models need to be utilized by other systems to make a ML powered software a success. The data needs to be easily accessible to developers. Here I discuss how user-content interaction data flows across such systems at Mad Street Den where we provide personalization services.
Here are some key aspects to take ML workflows to production:
Who did what interaction on what and when.
When a user visits a website, the user usually clicks on items listed by the site that they are interested in. For example, In the context of e-commerce retail, users usually will follow a journey of viewing a product, adding it to the cart and then buying it. Each user has their own unique journey. All this data is recorded along with the timing of the events. This makes up the user interaction data.
Offline data: (CSVs, JSON)
Offline data mostly exists in legacy systems. They are usually interfaced via files. These files need not have a standard schema & file format.
Online data: (JavaScript pixel, Queues)
Data comes with the help of a JavaScript pixel which is embedded into the website, which is then propagated to the systems via queues. The schemas and formats are defined according to an agreement or contract.
Standardize Data:
In order to build models on data that are configurable and portable we need to have some structure in the data.
Consistent data at the beginning of the process, makes it easier for every other system, downstream. There are pitfalls to maintaining numerous ETL pipelines for the varying formats & schemas as overheads in the production environment.
User-Content interaction data will always have these four values:
USER ID: User identifier to track the journey across events.
ITEM ID: Items that are listed in the website in question.
INTERACTION TYPE: Interaction type represents the type of action by a user on an item. Example: viewing/buying in the case of e-commerce or click/like/share in the case of social platforms.
TIMESTAMP: The timestamp of the event.
Batch is used in the sense that changes in the data doesn’t change the models built on that data too much too quickly.
Use case: Show to the user the products that the user may be interested in based on the product the user has interacted with (“You May Also Like”).
Model:
Idea: If a lot of users follow the same journey across products, then there is some similarity in those products. ALS + Bucketed LSH, Item Factors don’t change much with a few events.
Architecture: How data flows across different systems and why those systems are needed for it.
Frequent changes in the data are to be captured and hence the models need to be utilized at much higher pace to utilize the information present in the data.
Use case 1: Show products ranked based on popularity.
Use case 2: Show to the user the most popular products.
Model:
Idea: Find abnormal surge in volume of interactions for the product. Forecasting + Trending Scorer, Counts do change abruptly within a few time frame windows for trending products.
Architecture: How data flows across different systems and why those systems are needed for it.
In case we need to process data fast we can use structure streaming.
Slides: https://docs.google.com/presentation/d/14CuBAg_jTt7Pe_yKVaBP7TsONYZtUp7sj9ygXgGHjy4/edit?usp=sharing
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}