Anthill Inside 2019

A conference on AI and Deep Learning


Unsupervised Catalog Generation with Clustering, Reinforcement and More

Submitted by Govind Chandrasekhar (@gc20) on Friday, 5 April 2019

Preview video


This presentation will look at how you can generate product catalogs from ecommerce websites using just the homepage URL of the website. Techniques explored include URL clustering, regex generation, reinforcement learning and supervised classification.


Presentation structure:

  • Intro: What the problem is, why it’s useful and its roots in the Semantic Web movement.

  • Identifying Product URLs: The need to identify product pages from just their URLs. Using URL signatures + clustering + regex generation + supervised classification to solve this problem.

  • Spidering Strategy: Optimal strategy for spidering through the website to find product URLs, using reinforcement learning techniques.

  • Context Extraction: Techniques for extracting structured data from HTML + rendered webpages, notably through the use of bounding boxes. Then, we look at variation identification and extraction through the use of headless browsers.

Speaker bio

Govind is a co-founder of Semantics3. Semantics3 offers data and AI based enterprise solutions for ecommerce marketplaces (catalog generation & enrichment, seller on-boarding) and logistics companies (HTS/tariff classification, attribute enrichment). We’re a 7+ year old Y Combinator backed startup based in Bengaluru, San Francisco and Singapore.

Our data-science team works on problems like product categorization, product matching, named entity recognition and unsupervised content extraction.


Preview video


  • Anwesha Sarkar (@anweshaalt) 6 months ago

    Thank you for submitting the proposal. Submit your slides and preview video by 20th April (latest) it helps us to close the review process.

    • Govind Chandrasekhar (@gc20) Proposer 6 months ago

      Havent been able to get to this yet. Aiming for tomorrow. Hope that’s fine!

      • Govind Chandrasekhar (@gc20) Proposer 6 months ago
  • Zainab Bawa (@zainabbawa) Reviewer 5 months ago (edited 5 months ago)

    Here are the comments from the review of the slides and preview video:

    1. In its current form, the proposed talk is shallow. It needs to go 1-2 levels deeper in each section. For example, how url clustering really works, or how reinforcement learning is done for finding the strategy to figure out the structure of the catalog.
    2. For each section, explain: what is the problem? The thought process for solving it, and why you chose the approach that you eventually used? What didn’t work and why? What worked?
    3. Or, repurpose the proposal for data engineers by showing how you do scraping at scale, and such challenges, and what is the tooling and thinking behind your approach. Moving away from ML engineering to data engineering will bring more focus to the proposal.

    We’ll need revised slides, incorporating the above comments, by or before 21 May, in order to close the decision on the proposal.

    • Govind Chandrasekhar (@gc20) Proposer 5 months ago

      Thanks for the feedback Zainab. Slides updated.
      I’ve gone in-depth for the 3 key algorithms, and included instances of alternative approaches that did/didn’t work out.
      Didn’t want to go down the data engineering path. I think there’s more of a story with the ML ideas, and how they come together to solve the overarching business problem.

Login with Twitter or Google to leave a comment