Finding needles in high dimensional haystacks: Product Matching in Retail
Session type: Short talk of 20 mins
Matching the same and similar products is a problem fundamental to the online retail industry with multiple applications spanning across price optimization, recommending similar or substitute products to customers, understanding gaps in product assortments, and counterfeit product detection.
Given that that there are no standard product identifiers, catalog data is often noisy, incomplete and nonstandard, product matching is a challenging problem at scale. In this talk we will define the problem of product matching and discuss what makes it a hard problem. We will then discuss our approaches towards addressing it.
We use an ensemble of text and image-based approaches: content-based image retrieval (that uses a novel hashing technique that we developed), CNN, language model based word embeddings (BERT and Transformer), and techniques from classical machine learning.
We have built an automated pipeline that adapts based on the category of products it is handling.
- The merger of text & image signals
- Importance & use of RNNs & CNNs
- Handling matching at scale: volume & variety (multiple product categories)
- The feedback loop for an effective & robust system
Byom Kesh Jha, Data Scientist – Semantics, DataWeave
Byom designs and develops predictive modelling technologies in multiple domains, especially in retail and education. He is extensively involved in the training & deployment of machine-learning models. His expertise lies in diverse NLP techniques, sequence learners - NERs, classifiers, building knowledge bases, deep learning, product aspect extraction, user-generated content analysis, and more.
Websites to Datasets
As a provider of Competitive Intelligence as a Service to eCommerce businesses and consumer brands, DataWeave aggregates and analyses product catalog data from eCommerce websites each day at massive scale. Once aggregated, this data is fed into a complex process of extraction, transformation, machine learning, and analyses. These operations are performed on a consistent basis to provide our custo… more