The Fifth Elephant 2019

Gathering of 1000+ practitioners from the data ecosystem

Up next

Scalable NLP Pipeline for Building Catalogue for MSMEs


Deepak Sharma


We want to build catalogue for millions of MSMEs across India. To achieve this we are bootstrapping the catalogue from raw product descriptions provided by inventory of current customers. This is a rich source of product entities. However since this data is specific to each customer, it is highly contextual with little common grammar. This makes it extremely difficult to identify a product entity from its raw decription. We attempt to solve this problem by doing dedupe at scale. We dedupe the product descriptions by finding a vectorized representation of the data, identifying k-nearest neighbours of each product description to create an adjacency graph and using graph based clustering methods to find clusters. We use an active learning approach to tune our parameters by providing similar products from different clusters and different products from same clusters for labeling.


Provide millions of small enterprises access to structured catalogue
Faster improved search
Aggregation Services like HSN, Tax Rate etc

Bootstrap catalogue from raw product descriptions available with existing customers
Existing customers create “masters” for inventory management and invoicing
These masters are Product Descriptions and Hierarchies but highly specific to the customer (Since core product imposes minimal structure on this ontology)
- e.g. GK Adv mat 10 PCs, Ks - Deo on. 225, Dab Real Pin Rs99 **
Extremely rich data covering breadth and depth of SKUs. Things we can not find elsewhere
**e.g. Guru Essence EDP 100ml, Lal Prive Rose Royale 100ml

Highly contextual Product Descriptions with very little common grammar
Uncommon abbreviations, transliterations, misspellings etc of attributes
No imposed structure and restrictions to attributes (Category attribute can have a value which is Category, Brand, Company or any other hierarchy user finds useful)
So attributes can not be directly resolved without a statistical attribute extraction unlike in ecommerce data
High volume with high product variation
No attribute ontology (Brands, Categories etc)

ML Task
We have identified 3 Stages - Dedupe/Cluster, Publish, Map
We are focused on the first stage
Key ML Task in the first stage
Find all unique product representations
Cluster/dedupe input representations into unique product representations
Map SKUs to their attributes to allow mapping and aggregations(Brand, Category, Unit of Measurement etc)
We also need to
Do it at scale
Continuously update as new data comes in

Our aim is to find an intermediate representation of raw product description data which will allow us to group these into micro-clusters which can be curated and published.
We first perform attribute extraction/role labeling of the raw product description. We do this using Deep Semantic Parsing models together with rule based models.
We explore various distributional representations of these extracted attributes using which we find nearest neighbors.
We then use the adjacency graph from the nearest neighbors to then find clusters of SKUs.
We use active learning to continuously update clusters by labeling a small sparse sample and using feedback to improve parsing and clustering models.

We parse and label the raw product description using a Deep Semantic Model (BiLSTM-CRF) and domain rules to extract attributes
We create distributed representation of extracted attributes.
To avoid N^2 computations we calculate nearest neighbours using Locally Sensitive Hashing on the word vectors.
Improve similarity by using lexical features around the text of the product description.
We use community detection methods to cluster the product descriptions. The idea being that the product titles which are similar will lie in a highly connected graph.
We sample from clusters by sampling similar products from different clusters and dis-similar products from same clusters. We use this feedback to update the parsing and clustering models.


Basics of Machine Learning.

Speaker bio

Masters in mechanical engineering, trained in industrial automation and process control
Lead data scientist with more than 10 years of experience in building ML solutions for manufacturing, automotive, ecommerce and consulting.
Multiple conference publications in IEEE and ASME
Have been an instructor and mentor for multiple data science courses