SRE Conf 2023
Rootconf For members

SRE Conf 2023

Availability and reliability 24/7- the SRE life

Tickets

Loading…

Schedule for the conference on 24 November is published.

Why SRE Conf?

When any organization goes from product market fit or beta test phase to production rollout, or from first x customers to 10x or 100x customers and starts scaling, they typically start running into challenges with systems stability and resiliency. These challenges change with every phase of growth. So does the need for having a SRE team and/or a DevOps team, and the role these teams play.
Unfortunately, there is no one-size-fits-all solution when it comes to what roles these teams should play, and which tools various teams should use to track the metrics and processes involved. But there are some common building blocks that apply in similar (and different) ways and forms for most teams. The idea of the SRE Conference is to get together and to know about these building blocks, share and learn about the themes that fall under the SRE umbrella.

SRE Conf tracks

SRE Conf is a two-track conference. The track, “Culture, career and Evolution” is more focused on leadership, team, and organizational topics while the “Stories from the Trenches” track will cover real-world scenarios, and lessons learned which will help engineers and engineering teams to upskill themselves by understanding experiences from their industry peers.

Culture, career, and evolution

  1. SRE v/s DevOps v/s Platform Engineering teams in organizations.
  2. Hiring and building SRE teams.
  3. Blameless postmortems.
  4. Role of AI in SRE/DevOps/Platforms.
  5. FinOps and cost optimization.
  6. SRE Anti-patterns

Stories from the trenches:

  1. Incident management.
  2. Change management.
  3. Scalability and performance.
  4. SLA/SLO and golden signals.
  5. Security and DevSecOps.
  6. Systems and networking.

Key takeaways for participants

  1. Improved understanding of organizational needs and requirements.
  2. Enhanced optimization skills.
  3. Networking opportunities.
  4. Knowledge sharing and community building.

Who should participate

  • Members of SRE, DevOps or platform teams.
  • A software developer or manager who is responsible for services running on any cloud platform or on-prem data center.
  • Technology leader of an engineering team that manages critical systems which should have minimal to zero downtime.

Speaking

If you are interested in speaking at the conference, submit your talk idea here. The editors - Sarika Atri, Safeer CM and Saurabh Hirani - will review your talk description and give feedback.

Speakers will also receive feedback and assistance during rehearsals from past speakers such as Sitaram Shelke.

Guidelines for speaking, speaker honorarium policy, and travel grant policy details are published here.

About the editors

This conference themes were set up by Sarika Atri and Safeer CM. Together with Saurabh Hirani, the three editors have:

  1. Reviewed the talks.
  2. Set up the editorial workflow.
  3. Finalized talk selections.
  4. Curated the schedule.

Sarika Atri is Software Architect with over twenty years experience in the industry. Sarika was reviewer for Rootconf Cloud Costs Optimization conference held in July 2023.
Safeer CM is Senior Staff SRE at Flipkart. He is author of Architecting Cloud-Native Serverless Solutions published by Packt.
Saurabh Hirani is former editor of Rootconf, and a passionate member of the community. Saurabh is SRE at Last9.io,. He has a keen interest in mentoring speakers.

Become a Rootconf Member to join

SRE Conf is a community-funded conference. It will be held in-person. Attendance is open to Rootconf members only. Support this conference with a membership. If you have questions about participation, post a comment here.

Sponsorship

Sponsorship slots are open for:

  1. Tool and solutions providers.
  2. Companies interested in tech branding for hiring.

Email sponsorship queries to sales@hasgeek.com

Contact information

Join the Rootconf Telegram group at https://t.me/rootconf or follow @rootconf on Twitter.
For inquiries, contact Rootconf at +91-7676332020.

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more
Chinmay Naik

Chinmay Naik

@chinmay_naik

Lessons learned while managing a Terabyte scale database for three nines of uptime

Submitted Oct 16, 2023

Title

Lessons learned while self-managing Terabyte scale database for three nines of uptime

Abstract

Managing uptime for stateless systems is relatively easy, you can scale (horizontally and vertically) by throwing more hardware. However, the uptime and reliability of stateful systems (such as databases) is hard. This talk covers some lessons learned managing production databases with Terabytes of data to achieve three nines of uptime.

I had to manage the uptime and scalability of the self-managed production MySQL cluster for a Fintech company (Flip.id). The transactional MySQL database cluster was 1.5TB in size and grew 6 GB per day. There were six database nodes, each with 32vCPU, 128GB RAM, and 2TB disks.

Some of the challenges I had to solve:

  • Observability and uptime monitoring
  • Scalability (disks, compute, etc.)
  • Read-write traffic routing across various DB nodes
  • Schema migrations
  • Managing database security
  • Controlling Replication lag
  • Making data available for analytics use case
  • Numerous prod incidents related to database uptime and performance

Each of these bullet points above is worthy of a talk in itself. However, I’ll cover all these challenges and how we solved them in our case. We started with very limited observability and gradually transitioned to three nines of uptime for the database. We eventually moved from a self-managed cluster to a cloud SaaS service (GCP’s CloudSQL), but that story is for another time. 😀

What’s in it for you?

You’ll learn tools and patterns that you can apply in your own work if you’re managing any stateful system (database, queues, etc).

You’ll learn the importance of:

  • System decoupling (when I cover ProxySQL and how it helped us decouple components)
  • Power of operationally simple tools (gh-ost and how it simplified schema migrations for us)
  • Dev-prod parity and testing your approaches with prod scale in your staging environment

There will be a lot of diagrams and storytelling instead of just bullet point slides for you to read. I recently spoke at RubyConf India about lessons for managing trade-offs between over-engineering and the Big-ball-of-mud when building software systems. I am attaching the video of that talk to give the review committee an idea about how I speak.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more