Rootconf Mini 2024

Geeking out on systems and security since 2012

Tickets

Loading…

Nidhi Agarwal

Enhancing resiliency through CI/CD at Zomato: Advanced Automation and Real-Time Safeguards

Submitted Oct 30, 2024

Abstract

Building a CI/CD pipeline capable of supporting 500+ engineers, and managing 600+ deployments across 300+ services daily is essential at Zomato’s scale. Efficient CI/CD pipelines are critical for streamlining the development process and ensuring secure deployments.

In this session, we’ll cover how we revamped our CI/CD setup using self-hosted GitHub Actions to overcome these challenges.

We’ll explore

Problems with the existing setup on AWS CodePipeline and CodeBuild

  • Dependency on the SRE team: Adding a new CI check or deployment pipeline requires spawning AWS resources, making the process dependent on SRE intervention.
  • Fragmented workflow: Developers need to switch between GitHub and AWS to trigger or monitor CI checks and deployments while also managing separate access controls for AWS.
  • Newcomers face a challenging learning curve due to the complexity of the setup.
  • Lack of Trigger Visibility and Traceability. Additionally, It was difficult to customize or take override inputs from services for common pipelines.
  • The absence of canary deployments was a major limitation, requiring us to deploy only during low-traffic periods.
  • Missing features like Revert, Auto Abort, Manual Approval, etc.

How we orchestrated our self-hosted Github Action infrastructure

  • From placing the job on runners to monitoring and alerting the failures.
  • Ensuring no resource wastage by having a controller to maintain the pool of runners. leveraging spot instances without affecting the developer experience.
  • Fully private architecture with internal service communication with integration test support.
  • Observability: Runner, Workflow, and job Level Monitoring
  • Custom features we built to improve job runtimes and developer experience. Ex: Package caching, Proto caching, Notifications on failures, Dynamic resource allocation to Jobs.
  • Next Steps:
    • Docker Image Caching

An overview of how the CI/CD flow looks and how the developer experience has improved. Covering Features:

  • Unified Platform: No more tool-switching between GitHub and AWS CodeBuild.
  • How we used Reusable workflows in services across different languages to run customizable linters.
  • Overview of merge and release workflows, including developer forks, branch management (dev, master branches), and the pull request flow.
  • Canary Deployment support
  • Continuous Monitoring of Services during deployment, with automated rollback or manual decisions to abort or continue in case of errors.
  • One-click Revert
  • Deploy Services in multiple accounts and regions within the same deployment.
  • Dynamic deployment access management
  • Hotfix deployments
  • Next Steps:
    • Standardize build spec

Key Takeaways

  • Building orchestration for self-hosted GitHub Actions and observability around it.
  • Enhancing the CI/CD experience for developers while ensuring robustness.

Target Audience

  • DevOps and SRE engineers interested in CI/CD automation.
  • Anyone planning to implement self-hosted GitHub Actions orchestration for their workflows.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

We care about site reliability, cloud costs, security and data privacy