The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Scalable NLP Pipeline for Building Catalogue for MSMEs

Submitted by Deepak Sharma (@deepaksharmaclustr) on Apr 15, 2019

Session type: Lecture Session type: Tutorial Status: Rejected


We want to build catalogue for millions of MSMEs across India. To achieve this we are bootstrapping the catalogue from raw product descriptions provided by inventory of current customers. This is a rich source of product entities. However since this data is specific to each customer, it is highly contextual with little common grammar. This makes it extremely difficult to identify a product entity from its raw decription. We attempt to solve this problem by doing dedupe at scale. We dedupe the product descriptions by finding a vectorized representation of the data, identifying k-nearest neighbours of each product description to create an adjacency graph and using graph based clustering methods to find clusters. We use an active learning approach to tune our parameters by providing similar products from different clusters and different products from same clusters for labeling.


Provide millions of small enterprises access to structured catalogue
Faster improved search
Aggregation Services like HSN, Tax Rate etc

Bootstrap catalogue from raw product descriptions available with existing customers
Existing customers create “masters” for inventory management and invoicing
These masters are Product Descriptions and Hierarchies but highly specific to the customer (Since core product imposes minimal structure on this ontology)
- *e.g. GK Adv mat 10 PCs, Ks - Deo on. 225, Dab Real Pin Rs99 *
Extremely rich data covering breadth and depth of SKUs. Things we can not find elsewhere
e.g. Guru Essence EDP 100ml, Lal Prive Rose Royale 100ml

Highly contextual Product Descriptions with very little common grammar
Uncommon abbreviations, transliterations, misspellings etc of attributes
No imposed structure and restrictions to attributes (Category attribute can have a value which is Category, Brand, Company or any other hierarchy user finds useful)
So attributes can not be directly resolved without a statistical attribute extraction unlike in ecommerce data
High volume with high product variation
No attribute ontology (Brands, Categories etc)

ML Task
We have identified 3 Stages - Dedupe/Cluster, Publish, Map
We are focused on the first stage
Key ML Task in the first stage
Find all unique product representations
Cluster/dedupe input representations into unique product representations
Map SKUs to their attributes to allow mapping and aggregations(Brand, Category, Unit of Measurement etc)
We also need to
Do it at scale
Continuously update as new data comes in

Our aim is to find an intermediate representation of raw product description data which will allow us to group these into micro-clusters which can be curated and published.
We first perform attribute extraction/role labeling of the raw product description. We do this using Deep Semantic Parsing models together with rule based models.
We explore various distributional representations of these extracted attributes using which we find nearest neighbors.
We then use the adjacency graph from the nearest neighbors to then find clusters of SKUs.
We use active learning to continuously update clusters by labeling a small sparse sample and using feedback to improve parsing and clustering models.

We parse and label the raw product description using a Deep Semantic Model (BiLSTM-CRF) and domain rules to extract attributes
We create distributed representation of extracted attributes.
To avoid N^2 computations we calculate nearest neighbours using Locally Sensitive Hashing on the word vectors.
Improve similarity by using lexical features around the text of the product description.
We use community detection methods to cluster the product descriptions. The idea being that the product titles which are similar will lie in a highly connected graph.
We sample from clusters by sampling similar products from different clusters and dis-similar products from same clusters. We use this feedback to update the parsing and clustering models.


Basics of Machine Learning.

Speaker bio

Masters in mechanical engineering, trained in industrial automation and process control
Lead data scientist with more than 10 years of experience in building ML solutions for manufacturing, automotive, ecommerce and consulting.
Multiple conference publications in IEEE and ASME
Have been an instructor and mentor for multiple data science courses



Preview video


  • Anwesha Sarkar (@anweshaalt) a year ago

    Thank you for your submission. Submit your preview video by 23rd March(latest). It helps us to provide a fair evaluation to the proposal and close the review process.

  • Deepak Sharma (@deepaksharmaclustr) Proposer a year ago

    Should that not be 23rd April?

  • Zainab Bawa (@zainabbawa) a year ago (edited a year ago)

    Thanks for sharing the slides and preview video, Deepak. Here are some of the comments from the review:

    1. The proposed talk begins with a description about Clustr which gives the impression that this talk is pitch for Clustr’s products for MSME. We do not permit company pitches on stage unless a talk is explicitly classified as a sponsored talk.
    2. Then the proposed talk goes on to describe the solution that is being built. The slides as well as the preview video end with a description of the solution, without clarifying why this solution is being described and/or what is the key insight that you are trying to share with participants.
    3. Overall, the goals and takeaways of this proposal are unclear. Help us understand this.
    • Deepak Sharma (@deepaksharmaclustr) Proposer a year ago

      Thank you for the review
      1. I think my approach was to explain the problem and provide a background. It has been my experience that without providing an insight into the problem and the type of data we are working with it is very difficult people to appreciate the complexity of the problem. I can reduce the description on Clustr’s goals and focus more on the problem. I will update that in the slides.
      2. The key takeaway I want to present is the ML pipeline we have built and how it has helped us extract value from very raw product description. The ML modules we have built feed into various business applications like Product Search, Inventory Management, Invoicing etc. Its impact on customer satisfaction etc would take some time to measure but I can share some examples as to how our pipeline has been able to handle the kind of sparse, contextual product descriptions.
      3. I will share more on key takeaways and goals. Add that to the presentation.

      • Deepak Sharma (@deepaksharmaclustr) Proposer a year ago

        When is the deadline to submit the updated presentation?

        • Zainab Bawa (@zainabbawa) a year ago

          23 May. The updated slides have to be uploaded here.

      • Zainab Bawa (@zainabbawa) a year ago

        Comments to your responses:

        1. What IS the problem? The problem has to be either about “scalable NLP pipelines and abstracted to a general problem statement beyond Clustr” OR about “unique challenges in building catalogue for MSMEs” or something else. This is where a lot of thinking is required for your proposal.
        2. A takeaway can be WHY you have built the ML pipelines in the way that you have (giving reasons for your approach based on the problem statement, how you thought about the problem, what approaches did you compare with, why you decided to go with the present tech stack). Explaining HOW you built the ML pipeline isn’t a takeaway. It is only a description of your present solution. Describing your solution doesn’t help the audience unless you foreground it in the larger context of the problem itself. The audience needs to primarily develop an appreciation of the problem; solutions are secondary.

        One way to structure the talk is to convert it into a war story, sharing detailed experiences of what it took for you to build scalable NLP pipelines (and explaining why you chose to scale the NLP pipelines) and the battles you fought to get to the present situation.

        You need to focus more time on thinking about the problem. Unless this is clear, we will only end up iterating versions of the slides without addressing the moot point.

  • Deepak Sharma (@deepaksharmaclustr) Proposer a year ago

    I have updated the presentation and restructured it based on above comments. I have also changed the focus of the presentation. Please let me know your thoughts.

    • Zainab Bawa (@zainabbawa) a year ago

      Thanks Deepak. The revised slides look clearer. We have requested reviewers from the community to review slides.

  • Zainab Bawa (@zainabbawa) a year ago

    Meanwhile, take us through the problem statement, by explaining it here so that we get a clearer picture of the problem. The slides provide multiple hints, but hearing it from you, here, will help us a great deal.

  • Abhishek Balaji (@booleanbalaji) a year ago

    Feedback from reviewers:

    • Two things would be interesting in this presentation:
      • Active learning: What is your decision making and annotation systems?
      • Bi-LSTM/CRF model: The fact that you used it is not interesting, but how you tuned it to work for your context/dataset would be.

    A talk centered around these two components would be very interesting to watch at the event. The rest of the stuff - pipeline, problem statement, NLP techniques like edit-distrance / soundex - are kind of part for the course for anybody working on ecommerce data.

  • Abhishek Balaji (@booleanbalaji) a year ago

    More feedback:

    Overall comment : needs major/dramatic revision, make your presentation lot more visual, substantially reduce text on the slides, have a crisp story to your presentation

    • fifthel is a tech conference, title can be much more interesting : uncovering structured catalogues from highly unstructured text data (as an example)
    • pic : MSME vs B2C - who are the buyers and who are the sellers
    • goal slide is the goal of your company, what is the goal of your talk, what is it that you want audience to remember?
    • approach slide is coming too early, where is the problem? I am lost
    • core single line problem statement is absent, what is it that you are solving precisely?
    • for each of the challenges, show visual example to crisply convey your point of view: what is deduplication? what is extraction of key attributes? have separate and very clear slides for these and such terms.
    • “first iteration slide” : what do you expect audience to do? stare at your slide, read your slide, listen to you reading this slide, listen to you talking about this slide, listen only to you and not look at the slide? it will take 2 minutes to read + digest + understand so much text on one slide
    • lexical similarity seems out of place
    • in general, a nice story to the overall presentation is missing: what is the essence of first vs second vs iteration? why do you need three iterations? what is changing across iterations? what are the key take aways?
    • if I am a data engineer or machine learning engineer, what are the learnings for me?
  • Abhishek Balaji (@booleanbalaji) 11 months ago

    Moving this to reject since the timelines dont fit for The Fifth Elephant 2019.

Login to leave a comment