Lessons from building and operating thousands of customer-specific models at Atlassian

Submitted May 30, 2025

I am submitting for: Speaking at the Fifth Elephant 2025 Annual Conference Type of submission: 30 mins talk Choose the topic your submission falls under: Data & ML Infrastructure track

In the complex world of 24/7 SaaS, operations like Alert oncall and Incident management are crucial. The variety of tools, architectural choices (like monoliths vs. microservices), and operational practices make a universal approach impractical.

Atlassian’s Jira Service Management (JSM) supports thousands of customers with diverse tech stacks and stringent privacy policies, presenting unique challenges. Each enterprise customer needs customized models trained on specific data, while ensuring real-time performance, reliability, and cost-effectiveness at scale.

At Atlassian, we’ve developed a scalable framework that trains, stores, deploys, and serves thousands of customer-specific models with sub-second latency.

Session Overview

In this session, we will explore Atlassian’s scalable machine learning framework that trains, stores, deploys, and serves customer-specific models in real-time. We’ll walk through how our platform is built to support enterprise-grade operations—from automated model training and efficient lifecycle management to real-time inference and robust observability. Using our Alert Grouping case study as an example, we’ll illustrate how this framework has powered machine learning use cases in AIOps to reduce on-call fatigue and improve incident detection.

ML Platform Capabilities

Customer-Specific Model Training – Each customer’s unique data patterns are met with tailored models that adapt autonomously and securely, all without compromising strict privacy standards.

Model Lifecycle Management – Leveraging Atlassian’s data lake and dedicated high-capacity clusters, our system continuously processes millions of operational data points and performs bulk training operations while supporting smooth model retrieval and updates.

Real-Time Inference – By focusing on lightweight, pattern-based inferencing models, we achieve sub-second latency across tenants, even during peak loads. Tenant-level caching and data residency compliant stores further enhance performance and reliability.

Observability and Governance – Robust monitoring tools track latency, accuracy, and model lineage in real-time without exposing sensitive data. Automated retraining triggered by significant data drifts ensures compliance and sustained performance.

Case Study: Alert Grouping

Using our framework, we drive smart incident management with Alert Grouping—a feature combining machine learning and generative AI to cluster alerts and deliver actionable insights for on-call engineers. The results speak for themselves: • Over 2,000 enterprise customers benefit from compact, meaningful alert groups. • Intra-group compactness averages 0.96, signifying high clustering quality. • Optimized training (p90 at 1.025 hours) with retraining activated only upon relevant drift ensures efficiency and scalability.

Key Takeaways

Building Proactive Systems – Learn how a well-architected ML framework does more than just react to incidents—it predicts and prevents them. Gain insights into designing infrastructure that continuously monitors and optimizes performance, reducing downtime and on-call stress.

Enterprise Data Security and Privacy – Discover methods for balancing personalized machine learning capabilities with stringent enterprise security standards, ensuring that customer-specific models are both high-performing and compliant.

Scalability with Flexibility—Understand the importance of a multi-tenant, adaptable architecture that scales seamlessly across diverse customer environments. This session will illuminate strategies for managing millions of data points and thousands of models without sacrificing efficiency.

Real-World Impact on Operations – Through the Alert Grouping case study, see concrete evidence of how advanced ML solutions can improve operational outcomes, streamline incident management, and ultimately, boost on-call engineer productivity.

Ideal For

Engineering professionals, DevOps teams, IT Operations experts, Heads of Engineering, Incident Management Specialists, Data Scientists

The Fifth Elephant 2025 Annual Conference CfP

Lessons from building and operating thousands of customer-specific models at Atlassian

Comments