As organizations race to integrate large language models into their products and workflows, a new requirement is emerging: the need to host private LLMs in a scalable, secure, and operationally manageable way.
This talk presents a practical, cloud-agnostic architecture for hosting private LLMs with strong security isolation and efficient AI operations at scale.
We’ll explore how to enforce isolation, establish secure private network boundaries, and build a hardened control plane to manage LLM lifecycle and infrastructure state.
Central to this architecture is a model-agnostic access layer or a gateway, which decouples downstream systems from specific model APIs. It provides a consistent interface across model types and versions, while enabling operational features like request authentication, batching, standard and semantic caching, and routing.
In addition to architecture, we’ll explore the operational challenges of managing private LLMs in production, such as GPU resource scaling, long-tail latency under concurrent load, scaling under unpredictable traffic, and cost optimization techniques.
Attendees will learn:
- How to design a secure isolation layer for private LLMs using cloud-native constructs.
- How to implement private, low-latency access using cloud-native networking primitives.
- The role of a model-agnostic AI Gateway (Access Layer) in:
- Unifying access across different LLM backends
- Managing API key auth and RBAC
- Implementing standard, semantic and conversational caching
- Aggregating requests for efficient batching
- Operational strategies for:
- Orchestration and Upgrades
- Reducing long-tail latency
- Controlling cost under bursty traffic
- Autoscaling Strategies
- Performance and Cost Tradeoffs
This session is designed for:
- Platform Engineers building secure AI infrastructure
- MLOps / DevOps Engineers managing the deployment and scaling of LLMs
- Cloud Infra and SRE Teams responsible for performance, availability, and cost control
- AI Engineers deploying private models in enterprise, internal, or regulated settings
- Anyone designing or running LLM infrastructure beyond prototypes
- Deploying LLMs effectively for multiple customers (tenants) goes beyond simple model hosting.
- Key challenges: Ensuring scalability, robust security, tenant data isolation, cost management, and integrating value-added features.
- Existing solutions often lack integrated, enterprise-grade capabilities, forcing organizations to build complex frameworks themselves for:
- Performance/Cost Optimization (Caching, Batching)
- Model Agnosticism & Upgrades
- Operational Needs (Auth, RBAC, Monitoring, Secure Networking)
- Presenting a robust, multi-tenant LLM platform architecture built on AWS.
- Designed for scalability, security, cost-efficiency, and ease of use for tenants.
- Strategy: AWS Account-per-Customer.
- Benefits:
- Strict data separation and isolation.
- Simplified per-tenant billing and cost tracking.
- Enables secure, customer-specific networking (PrivateLink).
- Facilitates meeting compliance requirements.
- Managed via a central control plane using cross-account IAM roles.
- Model Serving Layer:
- Leverages optimized toolkits (e.g., vLLM, NVIDIA NIM) for standardized inference APIs and performance.
- Model Serving Agent (on EC2): Manages model lifecycle (deploy, start/stop, update), reports health, collects metrics (for CloudWatch/Prometheus), and routes requests.
- Networking & Secure Access:
- Public Access: Standard ALB + ASG + Route53 setup for stable public endpoints.
- Private Access (Preferred): AWS PrivateLink for secure, private connectivity from customer VPCs to the LLM service (avoids CIDR conflicts, simplifies security).
- An intermediary service providing crucial features before hitting the model server.
- Authentication & Authorization: API Key management, Role-Based Access Control (RBAC).
- Intelligent Caching:
- Standard Caching: Key/Value store for identical prompts.
- Semantic Caching: Vector DB lookup for similar/paraphrased prompts.
- Request Batching: Aggregates requests for improved throughput and cost-efficiency (especially if not native to the toolkit).
- Challenge: Standard metrics (CPU/Network) don’t accurately reflect LLM load (GPU is the bottleneck).
- Solution: GPU-Centric Auto Scaling using AWS Auto Scaling Groups (ASGs).
- Collect GPU Utilization (%) via nvidia-smi on instances.
- Publish as Custom CloudWatch Metrics.
- Configure ASG Scaling Policies (Target Tracking/Step Scaling) based on these custom GPU metrics.
- Benefits: Accurate scaling, better performance, cost optimization by avoiding over/under-provisioning.
- User Control: Allow tenants to enable/disable auto-scaling and set min/max instance limits.
With the current set of databases, customers can only access data in a structured format. If they want to leverage that data with LLMs, they typically need to write a custom client that retrieves data from the database and sends it to the LLM.
Couchbase Server already supports User-Defined Functions (UDFs) via SQL++. We utilized UDFs to directly invoke LLMs and return the responses to the user.
However, we encountered a challenge: UDFs do not natively support authentication. While we had firewalls in place for our AI functions, relying on firewalls alone is not sufficient for robust security. To address this, we implemented AWS STS to generate temporary tokens, providing an additional layer of secure access.
Now that we support embedding models, we wanted to provide customers with a way to vectorize their existing data in Couchbase Server.
Couchbase Server includes its own protocol, DCP (Database Change Protocol), which streams document mutations to clients. One such client is the Eventing Service—an existing feature in Couchbase that allows users to write custom JavaScript logic to handle document mutations.
To deliver a seamless experience without reinventing the wheel, we chose to leverage the Eventing Service (already a DCP consumer) to vectorize customer data efficiently.
We also provided customer’s a way to add data to the database from PDF, text document etc. We created our own service UDS that extract JSON documents from these file and then insert them into the database.
We wanted to provide customers a way to query data using natural language by using Agent catalog.
Agent catalog manages their queries, they can integrate their own agents to this and when they give a agent catalog can perform a vector search and find the most relevent query and the agents can then execute that query.
- Building a successful multi-tenant LLM platform requires thoughtful architecture beyond basic deployment.
- Combining AWS best practices (Account-per-Tenant, PrivateLink) with custom components (Model Serving Agent, AI Gateway) and intelligent scaling (GPU metrics) delivers a powerful solution.
- Empowers customers to leverage LLMs securely and efficiently without managing the underlying complexity.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}