Anthill Inside 2019

A conference on AI and Deep Learning

Unsupervised Catalog Generation with Clustering, Reinforcement and More

Submitted by Govind Chandrasekhar (@gc20) on Apr 5, 2019

Status: Rejected

Abstract

This presentation will look at how you can generate product catalogs from ecommerce websites using just the homepage URL of the website. Techniques explored include URL clustering, regex generation, reinforcement learning and supervised classification.

Outline

Presentation structure:

  • Intro: What the problem is, why it’s useful and its roots in the Semantic Web movement.

  • Identifying Product URLs: The need to identify product pages from just their URLs. Using URL signatures + clustering + regex generation + supervised classification to solve this problem.

  • Spidering Strategy: Optimal strategy for spidering through the website to find product URLs, using reinforcement learning techniques.

  • Context Extraction: Techniques for extracting structured data from HTML + rendered webpages, notably through the use of bounding boxes. Then, we look at variation identification and extraction through the use of headless browsers.

Speaker bio

Govind is a co-founder of Semantics3. Semantics3 offers data and AI based enterprise solutions for ecommerce marketplaces (catalog generation & enrichment, seller on-boarding) and logistics companies (HTS/tariff classification, attribute enrichment). We’re a 7+ year old Y Combinator backed startup based in Bengaluru, San Francisco and Singapore.

Our data-science team works on problems like product categorization, product matching, named entity recognition and unsupervised content extraction.

Slides

https://docs.google.com/presentation/d/e/2PACX-1vSQoXGj0ZxG8tkWR-47oqABSsWjCq0rrecVlUgHtaHl9104FNImSHsUDp6h5IVJG9wJAHv_KNwt4-EK/pub?start=false&loop=false&delayms=3000

Preview video

https://www.youtube.com/watch?v=azZ8EW8F0cI

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('You need to be a participant to comment.') }}

{{ formTitle }}
{{ gettext('Post a comment...') }}
{{ gettext('New comment') }}

{{ errorMsg }}