How Appdynamics runs a Multi-tenant Kubernetes+Helm cluster with continuous deployment & monitoring

May 2018

7 Mon

8 Tue

9 Wed

10 Thu 08:15 AM – 05:25 PM IST

11 Fri 08:30 AM – 06:20 PM IST

12 Sat

13 Sun

Make a submission

NIMHANS Convention Centre, Bengaluru

##About Rootconf 2018 and who should attend:

Rootconf is India’s best conference on DevOps, SRE and IT infrastructure. Rootconf attracts systems and operations engineers to share real-world knowledge about building reliable systems.

The 2018 edition is a single track conference. Day 1 – 10 May – features talks on security. Colin Charles (chief evangelist at Percona Foundation), Pukhraj Singh (former national cybersecurity manager at UIDAI), Shamim Reza (open source enthusiast), Alisha Gurung (network engineer at Bhutan Telecom) and Derick Thomas (former network engineer at VSNL and Airtel Bharti) will touch on important aspects of infrastructure, database, network and enterprise security.

Day 2 – 11 May – is filled with case studies and stories about legacy code, immutable infrastructure, root-cause analysis, handling dependencies and monitoring. Talks from Exotel, Kayako, Intuit, Helpshift, Digital Ocean, among others, will help you evaluate DevOps tools and architecture patterns.

If you are a:

DevOps programmer
Systems engineer
Architect
VP of engineering
IT manager

you should attend Rootconf.

Birds Of Feather (BOF) sessions at Rootconf 2018 will cover the following topics:

DevSec Ops
Microservices - tooling, architecture, costs and culture
Mistakes that startups make when planning infrastructure
Handling technical debt
How to plan a container strategy for your organization
Evaluating AWS for scale
Future of DevOps

Rootconf is a conference for practitioners, by practitioners.

The call for proposals is closed. If you are interested in speaking at Rootconf events in 2018, submit a proposal here: rootconf.talkfunnel.com/rootconf-round-the-year-2018/

##Venue:

NIMHANS Convention Centre, Lakkasandra, Hombegowda Nagar, Bengaluru, Karnataka 560029.

Schedule, event details and tickets: https://rootconf.in/2018

For more information about Rootconf, sponsorships, outstation events, contact support@hasgeek.com or call 7676332020.

Hosted by

Rootconf

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

All submissions

Previous Next

How Appdynamics runs a Multi-tenant Kubernetes+Helm cluster with continuous deployment & monitoring

Submitted Mar 11, 2018

Section: Full talk Technical level: Intermediate

AppDynamics develops application performance management (APM) solutions that deliver problem resolution for highly distributed applications. Our platform is able to dynamically collect millions of performance data points across users’ applications and infrastructure. As a result of this, scaling our data platform architecture and making it reliable and fault resilient becomes crucial to the company’s success.

The talk starts with the scale at which our data platform operates, and the pain points teams started facing after onboarding more and more customers. Then the talk goes through the different container orchestration frameworks evaluated and why Kubernetes was chosen. Then the talk discusses the design requirements to ensure each platform subteam had enough freedom and isolation to develop a service from scratch and deploy it on production on their own increasing team’s velocity significantly. The talk discusses the workflow hence developed to build such a PaaS framework used by different teams to run their services deployed as just another tenant on a multi-tenant kubernetes cluster.

The talk then goes through the capabilities given to the newly onbarded team in the cluster. The common CI/CD pipeline, alerting, monitoring and logging framework designed for the cluster can be leveraged by every team independently of each other. The talk will then showcase how it manages a canary-like kubernetes setup for production deployments. Finally the talk concludes with the lessons learned while building such a workflow and while fighting a few production fires.

Outline

Introduction
- About the speaker and Appdynamics
Scale at which platform team operates
- Millions of metrics uploaded per minute
- Reliablity and resiliency guarantees
Pain points with the earlier architecture
- Typical limitations with a monolith - team collisions, scalability
- Zero downtime while upgrades was not possible
- Frequent outages
Why Kubernetes?
- Comparision points with other orchestration frameworks
- Why kubernetes shined among all.
Initial design requirements
- Reduce boilerplate code while running a new service
- Team resource isolation
- Ability to run on AWS and on-premise
- Provide CI/CD, logging, alerting, monitoring capabilities to the teams
Workflow from a code commit to the final deployment
- A pull-request to the Helm chart repository
- Teamcity Build pipeline runs a minikube cluster to verify the cluster health after PR
- Chart artifact is uploaded to AWS S3 on a successful PR merge
- a SQS event is triggered for a new S3 insert which kicks off a Jenkins pipeline
- Jenkins then automatically deploys the helm chart on the staging environments
- Production deployments still remain manual (discussed later).
Alerting, Logging and Monitoring
- How all container logs are collected and sent to Splunk
- How we use Prometheus as well as Appdynamics monitoring to collect all application and cluster level metrics
- How we use Alertmanager to route alerts to Slack / emails and also PagerDuty.
Production canary setup
- How we route traffic between two kubernetes clusters - one acting canary and other primary
- How it helps in testing production deployments
- How it also acts a standby to fallback in case of outage on the primary
Lessons Learned
- Initial hurdles faced
- Challenges in bringing new teams to onboard
- Few insights on production issues seen.

Requirements

Basic knowledge of kubernetes terms

Speaker bio

Prateek is a Senior Software Engineer at Appdynamics and is part of the platform Infrastructure team. His primary responsibilities include designing systems to help teams run their services smoothly on the kubernetes cluster. He also takes care of automating cluster setup on both AWS cloud and on-premise.

Prior to Appdynamics, Prateek finished his bachelors from IIT Kharagpur and masters from UT Austin. He has worked with IBM, Flipkart and Yelp.com as an infrastructure engineer.

His interests lie in distributed systems like Cassandra, Kafka, Zookeeper, ElasticSearch and distributed tracing systems like Zipkin.