10 steps to build-your-own data pipeline - for day 1 of your startup

This submission has been added to the schedule

10 steps to build-your-own data pipeline - for day 1 of your startup

Submitted Jan 14, 2019

Session type: Full talk of 40 mins

We are a gaming company making mass market social games. Since being in a consumer market where user experience is the the key, we had to rely heavily on data from Day 1 of game/product launches. This is the reason we actually built our data infrastructure in parallel to games/products and had it ready for production usage from begining itself. We relied heavily on ready-to-use systems but at the same time had to be cost sensitive being a startup. Setting up whole data-lake and heavy duty hdfs cluster was ruled out due to cost and maintenance overhead. We setup a lightweight data collection pipline to central queues which is then ingested in realtime to our warehouse of choice Redshift (reason being ease-of-use). Also, scaling such a system has its cost overheads when your product grows. So we had to design data retention and data querying capabilities such that we aren’t paying hefty bills as well as aren’t being limited in terms of querying real-time data from our users.

Outline

Be clear of Requirements and Constraints

Having a scalable system for data ingestion
Data design (Specific or Generic)
Querying interface - why stick to SQL?

Take time to Design Data

Walking through example of generic table design

Sort out Data production part first

Identify all possible data producers (and understand requirements). In our case -
Android/iOS app
Cannot keep sending each event over network
Cannot lose data even if app crashes or is killed
Keep out of context from the application itself
Microservice(s)
Cannot keep sending each event over network
Keep data collection agnostic of microservice itself

Design v1.0 of Data pipeline

How and why we chose “anti-pattern”

Choose/Design Data warehouse

Data design in Redshift
Compression ON for certain columns
Tuning for scale
Taking care of Querying patterns of Product Managers and Data scientists

Open up: Enable many Data Interfaces

On demand Data loading and querying: OnDemand Table(s)
Flexibility for complicated analysis: Adhoc redshift cluster(s)

Understand, Tune & Repeat
Optimize for Usage

Added more columns at generic level e.g.
More examples

Optimize for Cost & Ops

Retention policies of data
Not all events are of same importance
But all events should be accessible if required

Upgrade to v2.0 of Data pipeline

Speaker bio

I am Kumar Puspesh, CTO and Co-Founder of Moonfrog, India’s top mobile gaming company. We had to design a large scale data infrastructre from day 1 of our company to cater to our product needs. Having a cost sensitive as well as scalable approach helped us achieve large scale as a gaming company in India in short amount of time. At the same time taught us a lot of ingenious ways of building large scale infra customized for business and its users (rather than a generic paid solution and then changing your usage/requirements based on that).

Slides

https://docs.google.com/presentation/d/1qYkGQLzK8UO-f59TFgkj1GROAmFLkotoQWMZNYXJbFQ/edit#slide=id.g4a9ee349ba_2_75

The Fifth Elephant 2019