The Fifth Elephant round the year submissions for 2019

Submit a talk on data, data science, analytics, business intelligence, data engineering and ML engineering

Websites to Datasets

Submitted by Aayushi Pathak (@09aayushi) (proposing) on May 9, 2019

This is a proposal requesting for someone to speak on this topic. If you’d like to speak, leave a comment.

Session type: Short talk of 20 mins Status: Rejected


As a provider of Competitive Intelligence as a Service to eCommerce businesses and consumer brands, DataWeave aggregates and analyses product catalog data from eCommerce websites each day at massive scale. Once aggregated, this data is fed into a complex process of extraction, transformation, machine learning, and analyses. These operations are performed on a consistent basis to provide our customers with easily consumable and actionable insights.
To be precise, we aggregate over 200 million data points across 2000+ web sources to deliver 200+ reports each day.
The web sources span across multiple verticals ranging from eCommerce, travel, classified listings, mobile apps, and more. Having a generic aggregator to aggregate data from multiple websites across multiple domains is a significant challenge.


Why do we need data aggregators?
A singapore-based VC firm wants to analyze how its portfolio businesses were performing in India and wants to crawl the mobile apps of a few of its businesses, along with that of their competitors.
An investment firm in the US wants to take stock of how a web-based B2C business is growing every quarter, thereby enabling an informed buy/sell decision before an earnings call.
A ‘brand’ wants to analyze their share of voice on eCommerce websites and track pricing violations on online marketplaces.
Smart organizations are looking for ways to capture and store data (both internal and third-party) at scale, process it efficiently, and generate actionable insights consistently. This talk will throw some light on how we aggregate data at massive scale and convert unstructured Web data to consumable insights. We will also talk about several problems we encounter along the way and how we solve them.

Talk flow:
1. Evolution of the platform
2. Data Collection at scale
* horizontal scaling
* politeness policy
* bot blocking
3. Mobile App crawling
4. Correctness and completeness
5. Types of datasets
* Images
* Managing the datasets

Speaker bio

Mithun, Data Architect, DataWeave

I work as an architect in the data platforms team at DataWeave, a provider of Competitive Intelligence as a Service for eCommerce businesses and consumer brands. I design and manage data aggregation at scale, which involves writing crawlers, extracting structured data, and more.
I have 10 years of experience in the software industry, with extensive experience in building web crawlers for complex web environments.


Preview video


  • Abhishek Balaji (@booleanbalaji) a year ago

    Hello Aayushi/Mithun,

    Thank you for submitting a proposal. To proceed with evaluation, we need to see detailed slides for your proposal. Your slides must cover the following:

    • Problem statement/context, which the audience can relate to and understand. The problem statement has to be a problem (based on this context) that can be generalized for all.
    • What were the tools/options available in the market to solve this problem? How did you evaluate alternatives, and what metrics did you use for the evaluation?
    • Why did you pick the option that you did?
    • Explain how the situation was before the solution you picked/built and how it changed after implementing the solution you picked and built? Show before-after scenario comparisons & metrics.
    • What compromises/trade-offs did you have to make in this process?
    • What is the one takeaway that you want participants to go back with at the end of this talk? What is it that participants should learn/be cautious about when solving similar problems?
    • Is the tool free/open-source? If not, what can the audience takeaway from the talk?

    We need to see the updated slides on or before 21 May in order to close the decision on your proposal. If we do not receive an update by 21 May we’ll move the proposal for consideration at a future event.

    • Aayushi Pathak (@09aayushi) Proposer a year ago

      @Abhishek, the slide has been updated. Kindly review it.


Login to leave a comment