Rootconf 2017

On service reliability

Capacity planning at scale

Submitted Mar 10, 2017

Depending on your scale or architecture, capacity planning could be gigs of RAM or cloud VMs or physical computers at your colo.

For companies with a large and dedicated user base, getting the capacity planning wrong can cause outages and subsequently lead to loss of trust. Putting in excess capacity can lead to an inflated bill and a large operating cost.

This talk goes over how to go about understanding the services being offered, the affect the service has on downstream services, understand the risks and plan for failures.

Outline

Thursday night football games has a lot of viewers in USA. In 2016, Twitter acquired rights to stream some of these games. The Site reliability team worked with the engineers to make sure that this streaming occurs without issues

Twitter services are built using microservices architecture and each service uses multiple other services to get the work done. In such a large distributed system architecture, capacity planning becomes major challenge to site reliability.

This talk provides the challenges that the Site Reliabilty team tackled when setting up the infrastructure

  • Determine the primary drivers for increased load
  • Determine correlation (and hopefully causality) between the traffic and load on other microservices
  • Risk analysis and Plan B
  • Verification of the capacity planning

This talk if for people who want to learn how to do capacity planning for a new feature release, the tools that can be used for analysis and capacity verification.

Speaker bio

I have been working as a site reliability engineer since 2005, first at Yahoo! and currently at Twitter. I worked during the launch of NFL Live Video during 2016

Comments