Production Grade DataOps Framework for Building Intelligence Over User-Content Interaction Data

Submitted Jul 14, 2021

Production Grade DataOps Framework for Building Intelligence Over User-Content Interaction Data

Level: Beginner

Timing: 15 min

Abstract:

The insights generated by ML models need to be utilized by other systems to make a ML powered software a success. The data needs to be easily accessible to developers. Here I discuss how user-content interaction data flows across such systems at Mad Street Den where we provide personalization services.

The talk is structured as follows:

Why do we need a DataOps framework?

Here are some key aspects to take ML workflows to production:

Data Maintenance: Data changes, so do the underlying patterns in the data. Trends don’t stay the same forever.
Model Portability: We want to have models that can be deployed on different cloud service providers (AWS, Azure) in different regions (US, EU, APS).
Access to data: It’s not just enough to get insights out of the data, we have to see how those insights will be used to solve a problem (Need real time APIs).
Scale: We need distributed systems to efficiently consume data.
Trust: Does the model work? Why should I trust the model? Model is just a function of data.

What is user interaction data?

Who did what interaction on what and when.
When a user visits a website, the user usually clicks on items listed by the site that they are interested in. For example, In the context of e-commerce retail, users usually will follow a journey of viewing a product, adding it to the cart and then buying it. Each user has their own unique journey. All this data is recorded along with the timing of the events. This makes up the user interaction data.

Data

Data Sources:

Offline data: (CSVs, JSON)
Offline data mostly exists in legacy systems. They are usually interfaced via files. These files need not have a standard schema & file format.

Online data: (JavaScript pixel, Queues)
Data comes with the help of a JavaScript pixel which is embedded into the website, which is then propagated to the systems via queues. The schemas and formats are defined according to an agreement or contract.

Standardize Data:
In order to build models on data that are configurable and portable we need to have some structure in the data.
Consistent data at the beginning of the process, makes it easier for every other system, downstream. There are pitfalls to maintaining numerous ETL pipelines for the varying formats & schemas as overheads in the production environment.

User-Content interaction data will always have these four values:

USER ID: User identifier to track the journey across events.
ITEM ID: Items that are listed in the website in question.
INTERACTION TYPE: Interaction type represents the type of action by a user on an item. Example: viewing/buying in the case of e-commerce or click/like/share in the case of social platforms.
TIMESTAMP: The timestamp of the event.

ML workflow Architecture that includes dataOps

Recommendations (Batch):

Batch is used in the sense that changes in the data doesn’t change the models built on that data too much too quickly.

Use case: Show to the user the products that the user may be interested in based on the product the user has interacted with (“You May Also Like”).

Model:

Idea: If a lot of users follow the same journey across products, then there is some similarity in those products. ALS + Bucketed LSH, Item Factors don’t change much with a few events.

Architecture: How data flows across different systems and why those systems are needed for it.

Eventual consistency is good enough most of the time.
Delta Lake let’s us write data from multiple sources without us worrying about schema and data corruption.

Frequent changes in the data are to be captured and hence the models need to be utilized at much higher pace to utilize the information present in the data.

Use case 1: Show products ranked based on popularity.

Use case 2: Show to the user the most popular products.

Model:

Idea: Find abnormal surge in volume of interactions for the product. Forecasting + Trending Scorer, Counts do change abruptly within a few time frame windows for trending products.

Architecture: How data flows across different systems and why those systems are needed for it.
In case we need to process data fast we can use structure streaming.

Eventual consistency is good enough most of the time.
Delta Lake let’s us write data from multiple sources without us worrying about schema and data corruption.

Takeaways:

Standardize data
Understanding models help in designing systems
Eventual consistency is enough (mostly)

Slides: https://docs.google.com/presentation/d/14CuBAg_jTt7Pe_yKVaBP7TsONYZtUp7sj9ygXgGHjy4/edit?usp=sharing

All submissions

Previous Next

Comments

Hosted by

The Fifth Elephant

Jumpstart better data engineering and AI futures