This tech talk proposes to dive into the evolution of Tecton’s real-time compute stack, a journey that started with sidecar processes, moved through serverless architecture, and ultimately matured into a native service deployed on virtual machines (VMs). The session will (hopefully) outline the challenges, lessons learned, and engineering decisions made at each stage.
I’d like to have the following rough sections:
- Tecton’s Realtime Data Stack: Introduce Tecton and its real-time data processing requirements for machine learning (ML) and feature serving, including how tecton executes user-defined post processing code in realtime feature retreival APIs.
- Initial Architecture - Sidecar Process: Explain how the compute stack initially relied on a sidecar model, what this architecture entailed, and its advantages in simplicity and quick iteration during early-stage development.
- Resource Contention: Describe the limitations of running sidecars alongside primary services, especially concerning resource isolation, security posture, network latencies, and scaling issues.
- Operational Complexity: How managing large-scale, sidecar-based microservices introduced operational overhead and complexity.
- Poor customer experience: Tecton dictated the environment/available libraries. Users couldn’t customize this because the environment was baked into our service code.
- The Appeal of Serverless: Discuss the decision to explore serverless functions (AWS Lambda) to handle real-time compute, reducing the overhead of managing servers and improving cost efficiency.
- We utilized AWS Lambda to execute aforementioned user-defined post processing code.
- We also extended our usage of AWS Lambda to build an API-driven ingestion service.
- Benefits: Elastic scaling, more flexible and user-managed python environments for their postprocessing code, and better security posture.
- Limitations Cold starts, poor performance even with warm starts, concurrency limits, debugging complexities, vendor lock-in. These challenges affected our real-time SLAs and generally led to poor user experience.
- What did we get wrong about serverless?
- For customers, performance is paramount.
- In our case, with a strict SLA but diverse workloads, serverless implementations are a non-started. The variance in latencies even in the happy path is too large.
- The cost for lambda functions is actually higher than a lean service serving the same workload. Execution times were higher on Lambda.
- Embedding state in Lambda (using Lambda layers) is high friction. Users changes take multiple minutes to reflect in their production workload. This is too long.
- The Shift to Native Services: After outgrowing serverless solutions, explain the move towards a native service on VMs. This section will cover why VMs were chosen over containers or Kubernetes in this specific case.
- Performance Gains: Far reduced latencies, better control over resource allocation, and improved predictability of performance for real-time ML feature serving. All while providing the same (in fact, better) user product experience.
- Operational Improvements: Describe how the switch to VMs simplified monitoring, debugging, and scaling at the infrastructure level.
- Cloud portablility. With the right abstractions in place, we were able to deploy these services to GCP with very little lift.
- Architectural Trade-offs: Key trade-offs between these architectures (sidecars vs. serverless vs. VMs), specifically as interpreted by us.
- Product lessons learned
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}