Rootconf Hyderabad edition

On SRE, systems engineering and distributed systems

Tickets

Achieving repeatable, extensible and self serve infrastructure at Gojek

Submitted by Tasdik Rahman (@tasdikrahman-gojek) on Thursday, 22 August 2019

Section: Full talk (40 mins) Category: Automation

View proposal in schedule

Abstract

From where we started, GO-JEK has grown to be a community of more than one million drivers with 3 Million+ orders every day in almost no time. To keep supporting this growth, hundreds of microservices run and communicate across multiple data centers to serve the best experience to our customers.

In this post, we’ll talk about our approach of assembling Infrastructure As Code that simplifies the maintenance of an increasingly complex microservices architecture for our company and how we have enabled the developers in helping maintain their own infrastructure by providing the tooling for them to do so in order to keep up with the increasing scale.

Motivation

Building infrastructure is without a doubt a complex problem evolving over time. Maintainability, scalability, observability, fault-tolerance, and performance are some of the aspects around it that demand improvements over and over again.

One of the reasons it is so complex is the need for high availability. Most of the components are deployed as a cluster with 100s of microservices and 1000s of machines running, As a result, no one knew what the managed infrastructure looked like, how the running machines were configured, what changes were made, how networks were connected to each other. In a nutshell, we were lacking observability into our infrastructure. And when there was a failure in the system, it was hard to tell what could’ve brought the system down.

Time and again, we used to get swamped with service requests for requests like
- creating an ES cluster
- creating a rabbitmq cluster
- creating a VMs which would have boilerplate for the type of application it would host.
- increasing disk size for a box
- create a postgres master/slave
etc which made us realise that we are effectively becoming the bottleneck for the developers and their pace of going from ideation to production.

Goals

  • to reduce developer toil
  • to automate repetitive tasks
  • have a central config for managing infrastructure for improved audit/repeatability.
  • move into a self serve model from infrastructure to remove the bottleneck from the systems team.
  • have templates for infra which gets created on a day to day basis and automate the systems person out the creation process.

We have been using Terraform for our IAC in bits and pieces for a while now, but what we were lacking, was structure and consistency. Different teams had different repositories. Modules were all over the place or inside the project itself. They were complex and there were lots and lots of bash scripts.

It was very challenging and error-prone to manually create infrastructure and maintaining it. We needed to switch from updating our infrastructure manually and embracing Infrastructure As Code and running it inside our CI/CD platform.

Infrastructure As Code allows you to take advantage of all software development best practices. You can safely and predictably create, change, and improve infrastructure. Every network, every server, every database, every open network port can be written in code, committed to version control, peer-reviewed, and then updated as many times as necessary.

Project Olympus is our initiative at GOJEK infrastructure engineering team to solve these problems and achieve mentioned goals.

Achieving a self serve model of infrastructure was achieved with Proctor (https://github.com/gojek/proctor), our automation orchestrator, using which a developer could get the infrastructure they wanted for themselves with the tooling helping them abstract out the work which we would do for having the boilerplate/security-tools/packages which we would put inside, logging/monitoring set up, dns creation for the services and other things like component creation (rabbitmq-cluster, ES cluster), disk size increase etc which usually used to happen by manual intervention.

Outline

*Goals
*Architecture and discussion around
- olympus which hosts our cloud infrastructure config
- terraform module structuring and how we run it our CI/CD platform
- how proctor (https://github.com/gojek/proctor) helps us in achieving the self serve model for infrastructure
*Lessons learnt
*Impact

Speaker bio

Tasdik is a Product Engineer at Gojek where he works with the systems team. Contributor to oVirt (under Redhat), before Gojek he was part of the early systems team of Razorpay. He has presented his talks at different national and international conferences including Pycon Taiwan, Devopsdays India and devconf India. A full list of the talks and slides/videos can be found at https://tasdikrahman.me/speaking/.

Links

Comments

  • Anwesha Sarkar (@anweshaalt) Reviewer a month ago

    Hello Tasdik,

    Here are the feedback from today’s rehearsal:

    1. Do not start your talk with a “so”.
    2. Include a seperate introduction slide introducing you.
    3. The talk ended abruptly.
    4. Need to include a take away slide.
    5. Include a conclusion slide.
    6. Explain proctor with Talina’s questions and use case.
    7. The CICD aspect of Proctor is open sourced? - mention it in the slide
    8. Explain the deletion aspect.
    9. Include a demo.
    10. Include the architecture diagram.
    11. Include an end slide having your contact credentials.

    Submit your revised slides by 27th September 2019.

    Regards
    Anwesha

  • Joy Bhattacherjee (@hashfyre) 21 days ago

    Hi Tasdik,

    I felt that there were three parts mashed into one.

    1. Generic organic growth of startups resulting in chaotic disjoint complex-systems
    2. Proctor and it’s architecture, as a solution to the first set of chaos
    3. Adaption of change and developer education on tooling

    The first section felt a bit non-informational to me, given that it took 10+ mins and 30+ slides to arrive at Proctor (second section).
    The use of memes as an welcome refresher felt overdone as they were present in the first section, which was low-complexity. I felt they lost their impact given consecutive usage, rather than usage as punctuation in the talk for relief.

    I really enjoyed section-2, Proctor and it’s architecture and the problem solving aspects. May be also explain how another org might use Proctor, and not only how Gojek uses it. The content is already good, so no feedback there.

    Section-3 again forayed into the territory of opinions and less of engineering, since no solution was proffered other than acknowledgement of the said problems faced by most ops folks in every org. I felt this section too can be cut short. I would have liked to hear about what engineering practices Gojek adopts to counter such issues.

Login with Twitter or Google to leave a comment