Tickets

Loading…

Ayan Sharma

Ayan Sharma

@ayan_sharma

From Data to Dialogue : Making Databases Conversational and Intelligent

Submitted Apr 30, 2025

Overview

As organizations race to integrate large language models into their products and workflows, a new requirement is emerging: the need to host private LLMs in a scalable, secure, and operationally manageable way.

This talk presents a practical, cloud-agnostic architecture for hosting private LLMs with strong security isolation and efficient AI operations at scale.

We’ll explore how to enforce isolation, establish secure private network boundaries, and build a hardened control plane to manage LLM lifecycle and infrastructure state.

Central to this architecture is a model-agnostic access layer or a gateway, which decouples downstream systems from specific model APIs. It provides a consistent interface across model types and versions, while enabling operational features like request authentication, batching, standard and semantic caching, and routing.

In addition to architecture, we’ll explore the operational challenges of managing private LLMs in production, such as GPU resource scaling, long-tail latency under concurrent load, scaling under unpredictable traffic, and cost optimization techniques.


Takeaways

Attendees will learn:

  • How to design a secure isolation layer for private LLMs using cloud-native constructs.
  • How to implement private, low-latency access using cloud-native networking primitives.
  • The role of a model-agnostic AI Gateway (Access Layer) in:
    • Unifying access across different LLM backends
    • Managing API key auth and RBAC
    • Implementing standard, semantic and conversational caching
    • Aggregating requests for efficient batching
  • Operational strategies for:
    • Orchestration and Upgrades
    • Reducing long-tail latency
    • Controlling cost under bursty traffic
    • Autoscaling Strategies
    • Performance and Cost Tradeoffs

Audience

This session is designed for:

  • Platform Engineers building secure AI infrastructure
  • MLOps / DevOps Engineers managing the deployment and scaling of LLMs
  • Cloud Infra and SRE Teams responsible for performance, availability, and cost control
  • AI Engineers deploying private models in enterprise, internal, or regulated settings
  • Anyone designing or running LLM infrastructure beyond prototypes

The Challenge: Why Multi-Tenant LLM Serving is Hard (Problem Statement)

  • Deploying LLMs effectively for multiple customers (tenants) goes beyond simple model hosting.
  • Key challenges: Ensuring scalability, robust security, tenant data isolation, cost management, and integrating value-added features.
  • Existing solutions often lack integrated, enterprise-grade capabilities, forcing organizations to build complex frameworks themselves for:
    • Performance/Cost Optimization (Caching, Batching)
    • Model Agnosticism & Upgrades
    • Operational Needs (Auth, RBAC, Monitoring, Secure Networking)

Our Solution: A Layered AWS Architecture (Overview)

  • Presenting a robust, multi-tenant LLM platform architecture built on AWS.
  • Designed for scalability, security, cost-efficiency, and ease of use for tenants.

Core Principle: Secure Tenant Isolation

  • Strategy: AWS Account-per-Customer.
  • Benefits:
    • Strict data separation and isolation.
    • Simplified per-tenant billing and cost tracking.
    • Enables secure, customer-specific networking (PrivateLink).
    • Facilitates meeting compliance requirements.
  • Managed via a central control plane using cross-account IAM roles.

Key Architectural Components

  • Model Serving Layer:
    • Leverages optimized toolkits (e.g., vLLM, NVIDIA NIM) for standardized inference APIs and performance.
    • Model Serving Agent (on EC2): Manages model lifecycle (deploy, start/stop, update), reports health, collects metrics (for CloudWatch/Prometheus), and routes requests.
  • Networking & Secure Access:
    • Public Access: Standard ALB + ASG + Route53 setup for stable public endpoints.
    • Private Access (Preferred): AWS PrivateLink for secure, private connectivity from customer VPCs to the LLM service (avoids CIDR conflicts, simplifies security).

AI Gateway: The Value-Add Layer:

  • An intermediary service providing crucial features before hitting the model server.
  • Authentication & Authorization: API Key management, Role-Based Access Control (RBAC).
  • Intelligent Caching:
    • Standard Caching: Key/Value store for identical prompts.
    • Semantic Caching: Vector DB lookup for similar/paraphrased prompts.
  • Request Batching: Aggregates requests for improved throughput and cost-efficiency (especially if not native to the toolkit).

Smart Scaling & Cost Management

  • Challenge: Standard metrics (CPU/Network) don’t accurately reflect LLM load (GPU is the bottleneck).
  • Solution: GPU-Centric Auto Scaling using AWS Auto Scaling Groups (ASGs).
    • Collect GPU Utilization (%) via nvidia-smi on instances.
    • Publish as Custom CloudWatch Metrics.
    • Configure ASG Scaling Policies (Target Tracking/Step Scaling) based on these custom GPU metrics.
  • Benefits: Accurate scaling, better performance, cost optimization by avoiding over/under-provisioning.
  • User Control: Allow tenants to enable/disable auto-scaling and set min/max instance limits.

Benefits & Use Cases

AI Functions

With the current set of databases, customers can only access data in a structured format. If they want to leverage that data with LLMs, they typically need to write a custom client that retrieves data from the database and sends it to the LLM.

Couchbase Server already supports User-Defined Functions (UDFs) via SQL++. We utilized UDFs to directly invoke LLMs and return the responses to the user.

However, we encountered a challenge: UDFs do not natively support authentication. While we had firewalls in place for our AI functions, relying on firewalls alone is not sufficient for robust security. To address this, we implemented AWS STS to generate temporary tokens, providing an additional layer of secure access.

Vectorization Service

Now that we support embedding models, we wanted to provide customers with a way to vectorize their existing data in Couchbase Server.

Introduction to DCP

Couchbase Server includes its own protocol, DCP (Database Change Protocol), which streams document mutations to clients. One such client is the Eventing Service—an existing feature in Couchbase that allows users to write custom JavaScript logic to handle document mutations.

To deliver a seamless experience without reinventing the wheel, we chose to leverage the Eventing Service (already a DCP consumer) to vectorize customer data efficiently.

UDS(Unstructured Data Service)

We also provided customer’s a way to add data to the database from PDF, text document etc. We created our own service UDS that extract JSON documents from these file and then insert them into the database.

Agent Catalog

We wanted to provide customers a way to query data using natural language by using Agent catalog.

Agent catalog manages their queries, they can integrate their own agents to this and when they give a agent catalog can perform a vector search and find the most relevent query and the agents can then execute that query.

Conclusion

  • Building a successful multi-tenant LLM platform requires thoughtful architecture beyond basic deployment.
  • Combining AWS best practices (Account-per-Tenant, PrivateLink) with custom components (Model Serving Agent, AI Gateway) and intelligent scaling (GPU metrics) delivers a powerful solution.
  • Empowers customers to leverage LLMs securely and efficiently without managing the underlying complexity.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid Access Ticket

Hosted by

Jump starting better data engineering and AI futures

Supported by

Gold Sponsor

Sahaj is an artisanal technology services company crafting purpose-built AI and data-led solutions for businesses.

Gold Sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Gold Sponsor

Together, we can build for everyone.

Bronze sponsor & Swag sponsor

AI-Powered Upskilling for Modern Data Professionals

Bronze Sponsor

Thoughtworks is a pioneering global technology consultancy, leading the charge in custom software development and technology innovation.

Community partner

Grace Hopper Celebration India 2025, hosted by AnitaB.org India, is Asia’s largest gathering of women and allies in technology.

Community partner

Bengaluru Systems Meetup