Scalable NLP Pipeline for Building Catalogue for MSMEs

Jul 2019

22 Mon

23 Tue

24 Wed

25 Thu 09:15 AM – 05:45 PM IST

26 Fri 09:20 AM – 05:30 PM IST

27 Sat

28 Sun

Make a submission

Accepting submissions till 15 Jun 2019, 01:00 PM

NIMHANS Convention Centre, Bengaluru

Tickets

Pinned update

The Fifth Elephant Winter edition starts at 9:30 am; live stream for members This update is for participants only

##The eighth edition of The Fifth Elephant will be held in Bangalore on 25 and 26 July. A thousand data scientists, ML engineers, data engineers and analysts will gather at the NIMHANS Convention Centre in Bangalore to discuss:

Model management, including data cleaning, instrumentation and productionizing data science.
Bad data and case studies of failure in building data products.
Identifying and handling fraud + data security at scale
Applications of data science in agriculture, media and marketing, supply chain, geo-location, SaaS and e-commerce.
Feature engineering and ML platforms.
What it takes to create data-driven cultures in organizations of different scales.

##Highlights:

1. Meet Peter Wang, co-founder of Anaconda Inc, and learn about why data privacy is the first step towards robust data management; the journey of building Anaconda; and Anaconda in enterprise.
2. Talk to the Fulfillment and Supply Group (FSG) team from Flipkart, and learn about their work with platform engineering where ground truths are the source of data.
3. Attend tutorials on Deep Learning with RedisAI; TransmorgifyAI, Salesforce’s open source AutoML.
4. Discuss interesting problems to solve with data science in agriculture, SaaS perspective on multi-tenancy in Machine Learning (with the Freshworks team), bias in intent classification and recommendations.
5. Meet data science, data engineering and product teams from sponsoring companies to understand how they are handling data and leveraging intelligence from data to solve interesting problems.

##Why you should attend?

Network with peers and practitioners from the data ecosystem
Share approaches to solving expensive problems such as cleanliness of training data, model management and versioning data
Demo your ideas in the demo session
Join Birds of Feather (BOF) sessions to have productive discussions on focussed topics. Or, start your own Birds of Feather (BOF) session.

##Full schedule published here: https://hasgeek.com/fifthelephant/2019/schedule

##Contact details:
For more information about The Fifth Elephant, sponsorships, or any other information call +91-7676332020 or email info@hasgeek.com

#Sponsors:

Sponsorship Deck.
Email sales@hasgeek.com for bulk ticket purchases, and sponsoring 2019 edition of JSFoo:VueDay.

JSFoo:VueDay 2019 sponsors:

#Platinum Sponsor

#Community Sponsors

#Exhibition Sponsors

#Bronze Sponsor

#Community Sponsors

Hosted by

The Fifth Elephant

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

All submissions

Previous Next

Scalable NLP Pipeline for Building Catalogue for MSMEs

Submitted Apr 15, 2019

Session type: Lecture Session type: Tutorial

We want to build catalogue for millions of MSMEs across India. To achieve this we are bootstrapping the catalogue from raw product descriptions provided by inventory of current customers. This is a rich source of product entities. However since this data is specific to each customer, it is highly contextual with little common grammar. This makes it extremely difficult to identify a product entity from its raw decription. We attempt to solve this problem by doing dedupe at scale. We dedupe the product descriptions by finding a vectorized representation of the data, identifying k-nearest neighbours of each product description to create an adjacency graph and using graph based clustering methods to find clusters. We use an active learning approach to tune our parameters by providing similar products from different clusters and different products from same clusters for labeling.

Outline

Goal
Provide millions of small enterprises access to structured catalogue
Faster improved search
Aggregation Services like HSN, Tax Rate etc

Approach
Bootstrap catalogue from raw product descriptions available with existing customers
Existing customers create “masters” for inventory management and invoicing
These masters are Product Descriptions and Hierarchies but highly specific to the customer (Since core product imposes minimal structure on this ontology)

**e.g. GK Adv mat 10 PCs, Ks - Deo on. 225, Dab Real Pin Rs99 **
Extremely rich data covering breadth and depth of SKUs. Things we can not find elsewhere
e.g. Guru Essence EDP 100ml, Lal Prive Rose Royale 100ml

Challenges
Highly contextual Product Descriptions with very little common grammar
Uncommon abbreviations, transliterations, misspellings etc of attributes
No imposed structure and restrictions to attributes (Category attribute can have a value which is Category, Brand, Company or any other hierarchy user finds useful)
So attributes can not be directly resolved without a statistical attribute extraction unlike in ecommerce data
High volume with high product variation
No attribute ontology (Brands, Categories etc)

ML Task
We have identified 3 Stages - Dedupe/Cluster, Publish, Map
We are focused on the first stage
Key ML Task in the first stage
Find all unique product representations
Cluster/dedupe input representations into unique product representations
Map SKUs to their attributes to allow mapping and aggregations(Brand, Category, Unit of Measurement etc)
We also need to
Do it at scale
Continuously update as new data comes in

Approach
Our aim is to find an intermediate representation of raw product description data which will allow us to group these into micro-clusters which can be curated and published.
We first perform attribute extraction/role labeling of the raw product description. We do this using Deep Semantic Parsing models together with rule based models.
We explore various distributional representations of these extracted attributes using which we find nearest neighbors.
We then use the adjacency graph from the nearest neighbors to then find clusters of SKUs.
We use active learning to continuously update clusters by labeling a small sparse sample and using feedback to improve parsing and clustering models.

Details
We parse and label the raw product description using a Deep Semantic Model (BiLSTM-CRF) and domain rules to extract attributes
We create distributed representation of extracted attributes.
To avoid N^2 computations we calculate nearest neighbours using Locally Sensitive Hashing on the word vectors.
Improve similarity by using lexical features around the text of the product description.
We use community detection methods to cluster the product descriptions. The idea being that the product titles which are similar will lie in a highly connected graph.
We sample from clusters by sampling similar products from different clusters and dis-similar products from same clusters. We use this feedback to update the parsing and clustering models.

Requirements

Basics of Machine Learning.

Speaker bio

Masters in mechanical engineering, trained in industrial automation and process control
Lead data scientist with more than 10 years of experience in building ML solutions for manufacturing, automotive, ecommerce and consulting.
Multiple conference publications in IEEE and ASME
Have been an instructor and mentor for multiple data science courses