Capacity planning at scale
Submitted by Raj Shekhar (@lunatech) on Friday, 10 March 2017
Full talk of 40 mins duration
Depending on your scale or architecture, capacity planning could be gigs of RAM or cloud VMs or physical computers at your colo.
For companies with a large and dedicated user base, getting the capacity planning wrong can cause outages and subsequently lead to loss of trust. Putting in excess capacity can lead to an inflated bill and a large operating cost.
This talk goes over how to go about understanding the services being offered, the affect the service has on downstream services, understand the risks and plan for failures.
Thursday night football games has a lot of viewers in USA. In 2016, Twitter acquired rights to stream some of these games. The Site reliability team worked with the engineers to make sure that this streaming occurs without issues
Twitter services are built using microservices architecture and each service uses multiple other services to get the work done. In such a large distributed system architecture, capacity planning becomes major challenge to site reliability.
This talk provides the challenges that the Site Reliabilty team tackled when setting up the infrastructure
- Determine the primary drivers for increased load
- Determine correlation (and hopefully causality) between the traffic and load on other microservices
- Risk analysis and Plan B
- Verification of the capacity planning
This talk if for people who want to learn how to do capacity planning for a new feature release, the tools that can be used for analysis and capacity verification.
I have been working as a site reliability engineer since 2005, first at Yahoo! and currently at Twitter. I worked during the launch of NFL Live Video during 2016