The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Similarity Search for Product Matching @ Semantics3

Submitted by abishekk92 (@abishekk92) on Apr 15, 2019

Session type: Lecture Session type: Full talk of 40 mins Status: Confirmed & Scheduled


One of the major offerings of Semantics3 is our universal product data catalog gathered through large scale indexing of the public web. For each catalog, duplicated entries of the same product across multiple retailers need to be merged/removed. In this talk, we will go through the technical challenges in such a large scale “product matching” system, where millions of products are often compared against millions of others (leading to trillions of pair-wise comparisons). Both traditional and state-of-the-art approaches will be discussed in solving this task.


Introduction [~ 5 mins]

This section will present an overview of the problem, the use cases that motivate it and establish the tone for the rest of the presentation.

Topics Covered:

  • Product Matching: What is it and Why is it important?
  • Similarity Search for Product Matching: What is it and how does it speed up matching?
  • Example Case for Similarity Search: Sample product document and sample query document to explain the following sections.

Traditional Text Search Approaches [~ 5 mins]

This section will cover our intial attempt at the similarity search problem using traditional text based methods largely leveraging elasticsearch.

Topics Covered:

  • Overview of how we set up the problem
  • Bottlenecks we hit and available tuning options
  • Examples of real queries

Lessons from Traditional Text Search Approaches [~ 5 mins]

This section will cover some of the key insights we gleaned from traditional text approaches and how we needed to reframe the problem.

Topics Covered:

  • The nature of our data/problem and why elasticsearch wasn’t a good fit.
  • Need for indexing multi-modal data
  • Examples of failed cases
  • Search is only as good as the document’s representation.

Representation Learning [~ 10 mins]

This section would cover how we reframed this as a representation learning problem and the different network architectures we tried, how we suited it to our needs, what worked/didn’t work and the challenges we faced along the way.

Topics Covered:

  • How we reframed the problem
  • Different network architectures we tried and their results.
  • Examples of success cases which had failed previously.
  • Infrastructure and scaling challenges

Infrastructure Challenges [~ 5 mins]

Solving the representation problem didn’t necessarily solve the similarity search problem. We only had a way to sufficiently represent all the product information on the vector space. This section will cover the infrastructure challenges, the options we considered and how we ended up choosing FAISS.

Topics Covered:

  • Challenges, Constraints
  • Re-evaluating Elasticsearch
  • Evaluating FAISS
  • Key bencmarks

Conlusion [~ 2 mins]


Familiarity with text search paradigms will be a good-to-have (not essential).

Speaker bio

Abishek is a member of the data science team at Semantics3, which offers data and AI solutions for ecommerce marketplaces (catalog generation & enrichment, seller on-boarding) and logistics companies (HTS/tariff classification, attribute enrichment). Among these, Abishek is the lead data scientist working on product matching and catalog generation.



Preview video


  • Zainab Bawa (@zainabbawa) a year ago

    Thanks for the submission, Abhishek. Upload your preview video and draft slides by 1 May to complete evaluation.

  • abishekk92 (@abishekk92) Proposer a year ago

    Zainab, I have added a video preview as well. Thanks!

  • abishekk92 (@abishekk92) Proposer a year ago

    Zainab and the panel, following are the key areas that one can look forward to in the talk.

    • Traidtionally search has been approached in the textual context, at times with a handful of hashing tricks to deal with multi modal data, such techniques however handsomely fail when dealing with data requiring richer and finer representation, say eCommerce Products. So the audience can look forward to listening about Representation Learning in the context of search.
    • Scaling and serving vector search hasn’t always been easy, atleast not until Microsoft open sourced SPTAG Server[0]. Since our architecture predates SPTAG, the audience can look forward to hearing about how my team evaluated the available options, eventually the choices we made to productionize the solution.

    [0] - SPTAG -

  • Zainab Bawa (@zainabbawa) a year ago

    Hello @abishekk92,

    Here is the feedback that came up in the review:

    1. The proposal is particularly interesting because it is more of a comparison of Elastic search versus FAISS, which brings more practical relevance than academics.
    2. You need to provide a deep dive on different network architecture models for the talk to be more in-depth.
    3. The infrastructure scaling part of the slides need deep dive.
    4. The “key benchmark” slide contains a placeholder for precision and other comparisons. Share real numbers to complete the slides.

    Overall, our assessment is that the proposal is interesting, but the slides are still work-in-progress. Until the material is completed on the slide, we can’t make a final decision on the talk. Therefore, if you share the completed slides by 8 June, then we can make a final decision.

  • Abhishek Balaji (@booleanbalaji) a year ago

    Hi Abishek,

    We’re scheduling a rehearsal for your talk. You’ll receive information about the rehearsal and the schedule on email. Do make sure to incorporate all the feedback suggested in your presentation.

  • Abhishek Balaji (@booleanbalaji) a year ago

    Thanks for going through the rehearsal Abishek. Here’s the feedback:

    • Time taken: 29 mins
    • Pace of the talk can be much faster.
    • Use points and speaker notes as cue/prompts for speaking
    • The talk is all about “What we’ve done”, but doesnt cover much on the journey and choices made along the way. For someone in the audience hearing about the choices and considerations is much more valuable.
    • Put down the takeaways more clearly and in a relatable manner.
    • Explain and illustrate where this fits into a larger architecture model
    • Spend more time on the network diagram and introducing LSTMs. Explain why they were the right choices
    • Work on presentation skills - need to be faster paced, more confident and energetic to grasp the attention of the audience.
    • Presentation still lacks novelty or generic takeaways in thsi talk.
    • Learnings from the presentation are specific to the usecase and might not be relatable with the audience.

    Abishek, do update your slides based on the feedback shared and get back by Jun 20, 2019. We’ll evaluate the revised slides to make a decision on your talk.

    • abishekk92 (@abishekk92) Proposer a year ago

      Abhishek and the panel, thanks for taking your time and providing the feedback. I have reworked the slides with what I think would be relevant for the audience, let me know if this is what you’d in mind.

      • Abhishek Balaji (@booleanbalaji) a year ago

        Ack, will get this evaluated again.

  • Abhishek Balaji (@booleanbalaji) a year ago

    Hi Abishek, we’re confirming this talk for the conference. We’ll be adding to the schedule and communicating on further updates. We’ll be scheduling a second round of rehearsal next week.

  • abishekk92 (@abishekk92) Proposer a year ago

    Hi Zainab,
    Thanks for the comment. Just getting around to this, have shared my draft slides. I should be able to upload a preview video before tomorrow, I hope that’s alright.

    • Zainab Bawa (@zainabbawa) a year ago

      Received both.

      A question that will come up in the review: We’ve had quite a few talks on “needles in the haystack” at The Fifth Elephant in the past. Check for videos. What is it about your proposal which is novel and different?

Login to leave a comment