The Fifth Elephant winter edition 2019

The Fifth Elephant winter edition 2019

Winter edition of India's most renowned conference on big data and data science

The Fifth Elephant is rated as India’s best conference on big data, data science and application of data to real-life use cases.

It is a conference for practitioners, by practitioners. The Fifth Elephant completed its seventh edition in Bangalore, on 26 and 27 July 2018. The Bangalore edition caters to data and ML engineers, architects, technologists, data scientists, product managers, researchers and business decision-makers.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Kumar Puspesh

@puspesh

Data Pipeline on Day 1 of your Startup : Cost and Scale sensitive!

Submitted Nov 2, 2018

We are a gaming company making mass market social games. Since being in a consumer market where user experience is the the key, we had to rely heavily on data from Day 1 of game/product launches. This is the reason we actually built our data infrastructure in parallel to games/products and had it ready for production usage from begining itself. We relied heavily on ready-to-use systems but at the same time had to be cost sensitive being a startup. Setting up whole data-lake and heavy duty hdfs cluster was ruled out due to cost and maintenance overhead. We setup a lightweight data collection pipline to central queues which is then ingested in realtime to our warehouse of choice Redshift (reason being ease-of-use). Also, scaling such a system has its cost overheads when your product grows. So we had to design data retention and data querying capabilities such that we aren’t paying hefty bills as well as aren’t being limited in terms of querying real-time data from our users.

Outline

Rough outline

  1. Business Requirements
  2. Usecase
  • Having a scalable system for data ingestion
  • Data design - Specific or Generic?
  • Querying interface - why stick to SQL?
  • Query interface users - skills, requirements and expectations
  1. Data ingestion
  • High throughput stats service
  • Thin client: Badger
  • High throughput Ingestion backend
  • Hot loading to Redshift
  1. Data Warehousing
  • Data design in Redshift and data lake
  • Tuning for scale
  • Taking care of Querying patterns of Product Managers and Data scientists
  1. S3 as Data Lake
  • On demand Data loading and querying: OnDemand Table(s)
  • Gotchas
  • Flexibility for complicated analysis: Adhoc redshift cluster(s)
  • Gotchas
  1. Scaling up
  • Typical bottlenecks and solutions we tried
  1. Learnings

Speaker bio

I am Kumar Puspesh, CTO and Co-Founder of Moonfrog, India’s top mobile gaming company. We had to design a large scale data infrastructre from day 1 of our company to cater to our product needs. Having a cost sensitive as well as scalable approach helped us achieve large scale as a gaming company in India in short amount of time. At the same time taught us a lot of ingenious ways of building large scale infra customized for business and its users (rather than a generic paid solution and then changing your usage/requirements based on that).

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more