The Fifth Elephant round the year submissions for 2019

Submit a talk on data, data science, analytics, business intelligence, data engineering and ML engineering

Participate Propose a session

Finding needles in high dimensional haystacks: Product Matching in Retail

Submitted by Aayushi Pathak (@09aayushi) (proposing) on Wednesday, 1 May 2019

This is a proposal requesting for someone to speak on this topic. If you’d like to speak, leave a comment.

Preview video

Session type: Short talk of 20 mins


Matching the same and similar products is a problem fundamental to the online retail industry with multiple applications spanning across price optimization, recommending similar or substitute products to customers, understanding gaps in product assortments, and counterfeit product detection.
Given that that there are no standard product identifiers, catalog data is often noisy, incomplete and nonstandard, product matching is a challenging problem at scale. In this talk we will define the problem of product matching and discuss what makes it a hard problem. We will then discuss our approaches towards addressing it.
We use an ensemble of text and image-based approaches: content-based image retrieval (that uses a novel hashing technique that we developed), CNN, language model based word embeddings (BERT and Transformer), and techniques from classical machine learning.
We have built an automated pipeline that adapts based on the category of products it is handling.


  • The merger of text & image signals
  • Importance & use of RNNs & CNNs
  • Handling matching at scale: volume & variety (multiple product categories)
  • The feedback loop for an effective & robust system

Speaker bio

Byom Kesh Jha, Data Scientist – Semantics, DataWeave
Byom designs and develops predictive modelling technologies in multiple domains, especially in retail and education. He is extensively involved in the training & deployment of machine-learning models. His expertise lies in diverse NLP techniques, sequence learners - NERs, classifiers, building knowledge bases, deep learning, product aspect extraction, user-generated content analysis, and more.


Preview video


  • Abhishek Balaji (@booleanbalaji) Reviewer a month ago

    Hello Aayushi/Byom Kesh,

    Thank you for submitting a proposal. To proceed with evaluation, we need to see detailed slides for your proposal. The link you’ve added needs to be edited to make it publicly accessible. Your slides must cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/options available in the market to solve this problem? How did you evaluate alternatives, and what metrics did you use for the evaluation?
    • Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?
    • Is the tool free/open-source? If not, what can the audience takeaway from the talk?

    We need to see the updated slides on or before 21 May in order to close the decision on your proposal. If we do not receive an update by 21 May we’ll move the proposal for consideration at a future event.

    • Aayushi Pathak (@09aayushi) Proposer a month ago

      Hi Abhishek,

      Thank you for your message. I have changed the sharing settings and the deck is now publicaly accessible.

      We will be re-structuring the deck based on your feedback and will surely send it by 21st May.

      Also, would like to follow up on the other proposal I’d sumbitted, - “Building a large-scale Data as a Service (DaaS) platform to consistently deliver high-quality datasets”. Do you want us to re-structure it in the same manner and re-submit it?


      • Abhishek Balaji (@booleanbalaji) Reviewer a month ago

        Sure, I’m adding comments separately on the other proposal as well.

        • Aayushi Pathak (@09aayushi) Proposer a month ago

          Great! Thanks. :)

  • Abhishek Balaji (@booleanbalaji) Reviewer a month ago (edited a month ago)

    In addition, please elaborate on how this proposal improves on other talks we’ve received on product matching (needle in haystack problems) -

  • Aayushi Pathak (@09aayushi) Proposer a month ago

    Also, the above link can’t be reached.

    • Abhishek Balaji (@booleanbalaji) Reviewer a month ago

      Fixed it.

  • Aayushi Pathak (@09aayushi) Proposer 27 days ago

    @Abhishek, the slide has been updated. Kindly review.


Login with Twitter or Google to leave a comment