Review of migrating distributed systems infrastructure

Feb 2025

3 Mon

4 Tue 10:00 AM – 11:00 AM IST

5 Wed

6 Thu

7 Fri

8 Sat

9 Sun

Tickets

All submissions

Review of migrating distributed systems infrastructure

Submitted Feb 4, 2025

Review date and time - 4 February 2025, 10 AM - 11 AM
Presenter - Priya Ananthshankar
Reviewers - Talina Shrotiya, Madhusudhan Sambojhu

Summary of the presentation: timely migrations & designing a migration framework

Introduction

Speaker: Priya Ananthasankar, Principal Software Engineer at Microsoft, US.

Disclaimer: Views expressed are personal.
Focus: Migrations within distributed systems, not on-prem to cloud.
Challenge: Moving from self-managed infrastructure to managed services (e.g., AKS).
Key concern: Loss of deep debugging ability in managed services.

Why timely migrations matter

Legacy systems become resilient over time but face constraints (security, patching, etc.).
Migrating to a managed service can be intimidating but is often necessary.
Key challenge: Ensuring the new service functions reliably before full migration.

Methodology for migration

1. Charting the course (A/B experimentation)

A/B Experiment: Route some traffic to the new service (B) while maintaining fallback to the old service (A).
Approaches:
- Client-level A/B test: Direct some requests to B while maintaining A as a fallback.
- Service-level A/B test: Gradually shift traffic to B internally.
- Feature flagging: Enable the new infrastructure for select users.
- Strangler Fig Pattern: Replace infrastructure components progressively.

2. Rollout Strategy

Two Approaches:
1. By region size: Start small and scale up.
2. By capacity: Migrate regions where the new infrastructure can handle the load.

3. Designing a migration compass

Fallback Metrics: Track how often requests revert to the old service.
Infrastructure Health Probes: Ensure new resources (VMs, containers) maintain stability.
Identifying Noisy Neighbors: Monitor anomalies in resource lifetimes (e.g., short-lived containers).
Service Limits & Retries: Prevent cascading failures from retry storms.
Scaling Constraints: Every resource (IPs, ports) has scale limits—factor them in.

Execution & completion criteria

Monitor dashboards tracking fallback rates and performance.
Gradually increase traffic to new infrastructure.
Define exit criteria: Migration is complete when:
- Fallbacks reach zero.
- SLA parity is achieved with the old system.

Conclusion

The framework is generalized to allow adaptation across different migrations.
Open for discussions on refining the migration “compass” approach.

Talina’s feedback

Overall Feedback

Set the purpose upfront that this migration is about migrating the orchestration of infrastructure from self-managed to fully-managed and doesn’t talk about data migration or infrastructure state migration.
Mention keywords during the talk that help highlight control plane and data plane capabilities and roles during the migration.
Stress on technical details while talking through the methodology - this would mean digging deeper into each step. With concrete examples, this can be solved better.
Add transition slides to ensure the story-telling is smooth.
Add a conclusion slide to help close off the talk. This slide should provide insights on how you were able to migrate to a fully-managed service in your experience.

Slide 1: Why migrations are intimidating

List down the actual reasons why migrations are intimidating, such as:
- Running in production and impacting customers
- Business dependency on the system
- Multiple versions and states to manage
- Edge cases and SLAs to consider
Consider splitting this slide to elaborate on these problems in more detail to set the stage for the migration challenges.

Slide 2: A/B experiment approaches

Name each approach explicitly (e.g., Client-based A/B Experiment).
Guide the audience on when to use each approach versus when not to use it.
While keeping it generic, allow the audience to identify the best fit for their use case.

At this point, we skimmed through a lot of diagrams without allowing the audience to understand all technical aspects. Allow spending more time to explain each usecase in detail.

Migration orchestration: self-managed to fully managed

Clearly explain that the key change is from self-managed orchestration to a fully managed service.
Emphasize that the data plane remains unchanged, while the control plane is now orchestrated automatically.
Use an example (e.g., EC2 with Docker containers) to illustrate how the orchestration is handled.

Observability and monitoring

Mention the architecture of control plane and data plane communication.
Explain how metrics, liveness, and health checks are sent from deployed infrastructure.
When discussing the monitoring dashboard, clarify how multiple EC2 instances are monitored at the control plane level.
Use the term “control plane” more often to reinforce its role in the architecture.
Also cover consumer based metrics that allows the consumer of the fully-managed service to monitor the behavior and performance. This is to highlight how you overcame the challenge where the consumer no longer controls orchestration of the infrastructure, but has to rely on the managed service to do so.

Conclusion: highlighting impact

Mention the impact and value driven from personal experience.
Use this to strengthen the conclusion with a personal touch.

Overall feedback

It was a well-structured and insightful presentation.
Additional refinements on clarity and framing of key concepts would enhance audience understanding.

Madhusudhan’s feedback

General presentation feedback

Strengths: Good presentation pace and clarity in delivery.
Improvement Area: The talk was too abstract at times, causing disconnect for diverse audiences.
Suggestion: Balance generic explanations with specific examples to clarify the scope.

Defining migration scope clearly

Many audience members were unclear about what was being migrated.
Clarify upfront:
- Are we migrating a system, database, or monolithic architecture?
- Are we moving from REST APIs to Lambda functions?
- Is the focus on containerization or broader migration strategies?
The talk primarily referenced container-based migrations—either explicitly broaden the scope or structure the discussion under a clear umbrella.

Incremental versus full migration

Call out whether the migration applies to:
- Incremental migrations: Suitable for application services and API endpoints.
- Full-shot migrations: Necessary for databases where all data must be transferred before switching over.
Defining these distinctions early would help avoid confusion.

A/B Testing and decision framework

More time should be spent on explaining A/B testing and decision-making frameworks.
Include stronger examples:
- What exactly is being tested?
- When should one approach be chosen over another?
- How do we determine the best fit for different scenarios?

Serverless migration terminology disconnect

The talk included elements of serverless migration but never explicitly mentioned “serverless.”
Fix: Ensure the terminology aligns with audience expectations, especially when discussing serverless containers.

Suggested improvements

Introduce concrete examples to illustrate general migration concepts:
- Noisy neighbor

Pramod Biligiri’s feedback and questions

General Feedback

Strengths: Good presentation.
Improvement Area: A running example (even a made-up one) would help make concepts more concrete.

Architecture considerations

Does this migration strategy only work for stateless services, or can it handle stateful services interacting with a database?
If a service writes to a database, how should the migration be designed to accommodate that?

Infrastructure health and observability

Some metrics were unclear—what insights were learned along the way?
What would someone miss if they didn’t track these metrics?
A/B testing discussion was interesting—highlight unintuitive insights that people might overlook.

Perspective of the end user

The talk focused on metrics from the maintainer’s point of view.
Consider also addressing:
- How users switching from self-managed to fully managed services perceive and measure health.
- How consumers of the fully managed system can track failure scenarios without direct infrastructure control.

Exit criteria and decommissioning

The term “exit criteria” was unclear—does it mean project completion or just migration completion?
Clarify decommissioning strategy:
- When and how to shut down old infrastructure.
- How to ensure zero fallback before decommissioning.
- Define what “done” means in a practical sense.

Observability vs. offline reconciliation

Observability gives real-time insights, but is there periodic reconciliation?
Suggests using offline checks or QA (e.g., data comparison, request validation) to ensure migration integrity.

Fallbacks and complexity

The fallback mechanism was not trivial and could benefit from more explanation.
Fallbacks within the same system are hard—switching back to the older system adds even more complexity.
Possible latency and SLA impacts should be discussed.
Some organizations handle fallback via an Ops team manually triggering the rollback—consider including this scenario.

Suggested improvements

Use a concrete example (even hypothetical) to illustrate key migration challenges.
Clarify terminology around exit criteria and decommissioning.
Expand fallback discussion with practical examples and real-world cases.
Balance observability with offline reconciliation for a more robust migration strategy.

Final thoughts

Overall, a well-presented talk.
Addressing these points would improve clarity and make the concepts more actionable for different audiences.

Apeksha Contractor’s feedback

Apeksha works primarily on infrastructure and was interested in how the migration strategy applies to infrastructure components like:

Databases
EngineX (or equivalent connectivity gateways)
Queuing systems

Migration challenges for infrastructure components

Unlike stateless services, infrastructure components require full migration—partial migrations are not always feasible.
Key concerns:
- How queuing systems or other infra components fit into the migration strategy.
- Handling cases where on-premise and cloud components cannot function in parallel (e.g., databases).
- Guardrails needed before fully committing to migration.

Exit criteria & decommissioning for infra

Asked about validation steps before decommissioning infrastructure.
Backup & rollback strategies:
- How to handle latency issues or failures that require a rollback to on-premise.
- How long to keep legacy infrastructure running before fully decommissioning.
- Ensuring live replication from cloud to on-premise in case of rollback.

Suggestions for presentation improvements

Make migration methodology more expressive:
- Use icons in slides to visually represent each migration step.
- Add transition slides between steps to smooth the flow.
Clarify step-by-step process:
- Define key steps engineers follow when migrating fully managed systems.
- Highlight the interim state where self-managed and fully managed infra coexist.
- Explain how observability ensures a smooth transition before deprecating the old system.

Final thoughts

Apeksha emphasized the difference between application and infrastructure migration and the need for distinct strategies.
Suggested improvements to presentation structure for better clarity and engagement.

All submissions

Comments

No comments posted yet

Feb 2025

3 Mon

4 Tue 10:00 AM – 11:00 AM IST

5 Wed

6 Thu

7 Fri

8 Sat

9 Sun

Hybrid access (members only)

Hosted by

Rootconf

We care about site reliability, cloud costs, security and data privacy