Rootconf 2017

On service reliability

Anshu Prateek

@anshprat

Life @ Freecharge on November 9th - A SRE perspective.

Submitted Jan 20, 2017

In this talk we will see the various challenges faced by SRE @ Freecharge in view of the demonetization campaigns. Scaling, Monitoring, Release, and above all, trying to work towards making SRE itself redundant!

Outline

November 8th will be a date every Indian will remember atleast for this decade. Payment companies seem to have been at the forefront of the race to take India to digital money. And the pace at which various changes have happened can be compared to that of an F1 race! And in a F1 race, one needs to have the pit crew working at their best to ensure that the driver can win the race. We as SRE many a times drive the role of pit crew, mechanic, R&D and a lot more.


Capacity planning

Review of the existing capacity for various key components - login, wallet, other backends - and spruce up the same where required (Login services traffic and utilization jumped up 2x overnight). Some other backends saw upto 3-4x traffic increase. Will see how the various backends were scaled - and how horizontal vs vertical scaling was decided. (5 minutes)


Load testing

The numbers that we used to see only during heavy campaigns became a thing of every afternoon (organically!). The numbers were getting capped at a certain xxxx requests at the top most layer. These topmost layer calls are further amplified at various backend layers. We needed to find and fix these chokepoints. (3 mins)


Architectural changes

We found (or rather were already aware of) the first bottleneck at the last layer run on top of a mongo replica. This setup is used by various services across the company. As a result, it sees 4-5x amplification of the frontend traffic. We looked at various ways to resolve the issue. Code changes will take time. We will discuess the various options that we reviewed, the one that we zeroed down upon and how we got it up and running within 18 hours. Also, how an effort to save 2 hours ended up adding 6 more hours to the operation! (15 mins)


Oncall/outages/response/COE/Postmortems

We took a hard look at the combined results of the load testing and the first campaign after that. It lead to a company wide exercise of capacity review and more architectural optimizations. (5 mins)

Monitoring - the above efforts of load testing and outages highlighted the already known gaps in monitoring. We will disucss what were these gaps and how they impacted us and how we are working on resolving the same.
(2 mins)

Security - With increased visibility, attacks on other fintech companies increased as well. We reviewed and strengthened our security setups.

Tuning - One specific example where we reduced the latency from 200ms to 1.77ms!

What next... Dockers! We will see why we are working towards using dockers.

And the end goal - to make SRE redundant! (How and why?!)
(10 mins)

Speaker bio

XY!, Ex-Aerospike, Ex-Reliance Jio and now at Freecharge. I ve seen and worked at scales of all levels - from thousands of machines to millions of tps in sub millisecond to working on the world’s largest startup targetting a Billion+ people! These various experiences are helping me in ensuring Freecharge remains the fastest wallet out there!

Slides

https://docs.google.com/presentation/d/1wdpSS4jAc4crKDPq0LKnGhRp093cffhv23ciYSoDs8o/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy