From 23d35a9820054b40b0e9ee1fc07dd469689cabe6 Mon Sep 17 00:00:00 2001 From: Omar Abdelwahab Date: Wed, 3 Dec 2025 12:52:34 -0800 Subject: [PATCH] Updated deployment documentation --- docs/docs/concepts/distributions.mdx | 173 +++++++++++++++- docs/docs/deploying/index.mdx | 189 +++++++++++++++++- .../starting_llama_stack_server.mdx | 171 ++++++++++++++-- docs/docs/getting_started/quickstart.mdx | 11 + 4 files changed, 525 insertions(+), 19 deletions(-) diff --git a/docs/docs/concepts/distributions.mdx b/docs/docs/concepts/distributions.mdx index 5680996644..ef6ba47d92 100644 --- a/docs/docs/concepts/distributions.mdx +++ b/docs/docs/concepts/distributions.mdx @@ -7,10 +7,175 @@ sidebar_position: 3 # Distributions -While there is a lot of flexibility to mix-and-match providers, often users will work with a specific set of providers (hardware support, contractual obligations, etc.) We therefore need to provide a _convenient shorthand_ for such collections. We call this shorthand a **Llama Stack Distribution** or a **Distro**. One can think of it as specific pre-packaged versions of the Llama Stack. Here are some examples: +## What is a Distribution? -**Remotely Hosted Distro**: These are the simplest to consume from a user perspective. You can simply obtain the API key for these providers, point to a URL and have _all_ Llama Stack APIs working out of the box. Currently, [Fireworks](https://fireworks.ai/) and [Together](https://together.xyz/) provide such easy-to-consume Llama Stack distributions. +A **Llama Stack Distribution** (or **Distro**) is a pre-configured package that bundles: +- A specific set of **API providers** (inference, memory, safety, etc.) +- **Configuration files** (`run.yaml`) with sensible defaults +- **Dependencies** needed to run those providers -**Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) or [NVIDIA NIM](https://build.nvidia.com/nim?filters=nimType%3Anim_type_run_anywhere&q=llama) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros. +Think of distributions as "starter templates" that package everything you need for specific use cases. -**On-device Distro**: To run Llama Stack directly on an edge device (mobile phone or a tablet), we provide Distros for [iOS](/docs/distributions/ondevice_distro/ios_sdk) and [Android](/docs/distributions/ondevice_distro/android_sdk) +### Why Distributions? + +While Llama Stack offers flexibility to mix-and-match providers, most users work with specific combinations based on: +- **Hardware availability** (GPU vs CPU) +- **Deployment environment** (cloud vs edge vs local) +- **Provider preferences** (open-source vs managed services) + +Distributions provide a convenient shorthand for these common combinations, saving you from manually configuring each component. + +## Distribution vs Deployment + +It's important to understand the difference: + +| Concept | What it is | Example | +|---------|-----------|---------| +| **Distribution** | _What_ you're running (the package) | `starter`, `meta-reference-gpu` | +| **Deployment** | _How/Where_ you're running it | Docker container, K8s cluster, library mode | + +**Example**: You might choose the `starter` distribution and deploy it: +- As a Docker container for testing +- In a Kubernetes cluster for production +- As a Python library for development + +## Types of Distributions + +### 1. Remotely Hosted Distributions + +**Best for**: Production use without infrastructure management + +These distributions are fully managed by third-party providers. You simply: +1. Obtain an API key +2. Point to their URL +3. Get all Llama Stack APIs working instantly + +**Available Providers**: +- [Fireworks.ai](https://fireworks.ai/) - Production-ready managed Llama Stack +- [Together.xyz](https://together.xyz/) - Scalable hosted solution + +**When to use**: +- ✅ You want to focus on building, not infrastructure +- ✅ You need production reliability without DevOps overhead +- ✅ You're okay with using a managed service + +**Learn more**: [Remote-Hosted Distributions](/docs/distributions/remote_hosted_distro/) + +### 2. Self-Hosted Distributions + +**Best for**: Custom infrastructure, specific hardware, or self-contained deployments + +Run Llama Stack on your own infrastructure with control over all components. + +#### `distribution-starter` +**Recommended for beginners and remote inference** + +- **Inference**: Ollama (local CPU) or remote providers (Fireworks, Together, vLLM, TGI) +- **Hardware**: Any machine (CPU is sufficient) +- **Use cases**: Prototyping, development, remote inference deployments + +```bash +docker pull llamastack/distribution-starter +``` + +**Learn more**: [Starter Distribution Guide](/docs/distributions/self_hosted_distro/starter) + +#### `distribution-meta-reference-gpu` +**For GPU-powered local inference** + +- **Inference**: Meta Reference implementation (PyTorch-based) +- **Hardware**: NVIDIA GPU required (24GB+ VRAM recommended) +- **Use cases**: Maximum control, on-premises GPU deployments + +```bash +docker pull llamastack/distribution-meta-reference-gpu +``` + +**Learn more**: [Meta Reference GPU Guide](/docs/distributions/self_hosted_distro/meta-reference-gpu) + +#### `nvidia` Distribution +**For NVIDIA NeMo Microservices** + +- **Inference**: NVIDIA NIM (NeMo Inference Microservices) +- **Hardware**: NVIDIA GPU with NeMo support +- **Use cases**: Enterprise NVIDIA stack integration + +**Learn more**: [NVIDIA Distribution Guide](/docs/distributions/self_hosted_distro/nvidia) + +**When to use self-hosted**: +- ✅ You have specific hardware requirements (GPUs, on-prem) +- ✅ You need full control over data and infrastructure +- ✅ You want to customize provider configurations +- ✅ You're subject to data residency requirements + +### 3. On-Device Distributions + +**Best for**: Mobile apps and edge computing + +Run Llama Stack directly on mobile devices with optimized on-device inference. + +**Available SDKs**: +- [iOS SDK](/docs/distributions/ondevice_distro/ios_sdk) - Native Swift implementation +- [Android SDK](/docs/distributions/ondevice_distro/android_sdk) - Native Kotlin implementation + +**When to use**: +- ✅ You're building mobile applications +- ✅ You need offline/edge inference capabilities +- ✅ You want low-latency responses on devices + +## Choosing a Distribution + +### Decision Tree + +```mermaid +graph TD + A[What are you building?] --> B{Mobile app?} + B -->|Yes| C[Use iOS/Android SDK] + B -->|No| D{Have GPU hardware?} + D -->|Yes| E[Use meta-reference-gpu] + D -->|No| F{Want to manage infrastructure?} + F -->|No| G[Use Remote-Hosted
Fireworks/Together] + F -->|Yes| H[Use starter distribution
with remote inference] + + style C fill:#90EE90 + style E fill:#87CEEB + style G fill:#FFB6C1 + style H fill:#DDA0DD +``` + +### Quick Recommendations + +| Scenario | Distribution | Deployment Mode | +|----------|--------------|-----------------| +| Just starting out | `starter` | Local Docker | +| Developing with remote APIs | `starter` | Library mode | +| Production with GPUs | `meta-reference-gpu` | Kubernetes | +| Production without managing infra | Remote-hosted | Fireworks/Together | +| Mobile app | iOS/Android SDK | On-device | + +## Next Steps + +1. **Explore distributions**: [Available Distributions](/docs/distributions/list_of_distributions) +2. **Choose deployment mode**: [Starting Llama Stack Server](/docs/distributions/starting_llama_stack_server) +3. **Deploy to production**: [Deploying Llama Stack](/docs/deploying/) +4. **Build applications**: [Building Applications](/docs/building_applications/) + +## Common Questions + +### Can I switch distributions later? + +Yes! Llama Stack's standardized APIs mean your application code remains the same regardless of which distribution you use. You can: +- Develop with `starter` + Ollama locally +- Test with `starter` + Docker container +- Deploy with `meta-reference-gpu` in Kubernetes + +### Can I customize a distribution? + +Absolutely! See: +- [Building Custom Distributions](/docs/distributions/building_distro) +- [Customizing Configuration](/docs/distributions/customizing_run_yaml) + +### What's the difference between a distribution and a provider? + +- **Provider**: Implementation of a specific API (e.g., Ollama for inference) +- **Distribution**: Bundle of multiple providers configured to work together diff --git a/docs/docs/deploying/index.mdx b/docs/docs/deploying/index.mdx index eaa0e2612c..29124e7e0a 100644 --- a/docs/docs/deploying/index.mdx +++ b/docs/docs/deploying/index.mdx @@ -10,5 +10,190 @@ import TabItem from '@theme/TabItem'; # Deploying Llama Stack -[**→ Kubernetes Deployment Guide**](./kubernetes_deployment.mdx) -[**→ AWS EKS Deployment Guide**](./aws_eks_deployment.mdx) +This guide helps you understand how to deploy Llama Stack across different environments—from local development to production. + +## Understanding Llama Stack Deployment + +Llama Stack can be deployed in multiple ways, each suited for different stages of your development lifecycle: + +```mermaid +graph LR + A[Development] --> B[Testing] + B --> C[Staging] + C --> D[Production] + + A -.-> E[Local / Library Mode] + B -.-> F[Docker Container] + C -.-> G[Kubernetes] + D -.-> H[Kubernetes / Cloud-Managed] +``` + +### Deployment Modes + +Llama Stack supports three primary deployment modes: + +1. **Library Mode** (Development) + - No server required + - Perfect for prototyping and local testing + - Uses external inference services (Fireworks, Together, etc.) + - See: [Using Llama Stack as a Library](/docs/distributions/importing_as_library) + +2. **Container Mode** (Testing & Staging) + - Pre-built Docker images with all providers + - Consistent environment across systems + - Easy to start with `docker run` + - See: [Available Distributions](/docs/distributions/list_of_distributions) + +3. **Kubernetes Mode** (Production) + - Scalable and highly available + - Production-grade orchestration + - Support for AWS EKS and other K8s platforms + - See deployment guides below + +## Choosing Your Deployment Strategy + +### Start Here: Development + +If you're just starting out: + +1. **Begin with the [Quickstart Guide](/docs/getting_started/quickstart)** - Get Llama Stack running locally in minutes +2. **Use Library Mode or `starter` Distribution** - No complex infrastructure needed +3. **Prototype your application** with Ollama or remote inference providers + +### Moving to Testing & Staging + +Once your application is working locally: + +1. **Containerize with Docker** - Package your app with a distribution +2. **Choose the right distribution** for your hardware: + - `distribution-starter` for CPU/remote inference + - `distribution-meta-reference-gpu` for GPU inference + - See: [Available Distributions](/docs/distributions/list_of_distributions) +3. **Configure environment variables** for your providers + +### Production Deployment + +For production workloads: + +1. **Deploy to Kubernetes** using our guides below +2. **Configure monitoring and logging** - See [Telemetry](/docs/building_applications/telemetry) +3. **Set up proper secrets management** for API keys +4. **Consider managed hosting** for turnkey production: + - [Fireworks.ai](https://fireworks.ai) + - [Together.xyz](https://together.xyz) + +## Production Deployment Guides + +### Self-Hosted Production + +
+ +

🚢 Kubernetes Deployment

+

Deploy Llama Stack to any Kubernetes cluster with full control over infrastructure and scaling

+
+ + +

☁️ AWS EKS Deployment

+

Deploy on Amazon EKS with optimized configurations for AWS infrastructure

+
+
+ +### Managed Hosting + +For production deployments without infrastructure management, consider **remote-hosted distributions** from: +- [Fireworks.ai](https://fireworks.ai) - Fully managed Llama Stack API +- [Together.xyz](https://together.xyz) - Production-ready hosting + +See: [Remote-Hosted Distributions](/docs/distributions/remote_hosted_distro/) + +## Common Deployment Patterns + +### Pattern 1: Local Development → Cloud Production + +```bash +# 1. Develop locally with Ollama +ollama run llama3.2:3b +llama stack run starter + +# 2. Test with containers +docker run -p 8321:8321 llamastack/distribution-starter + +# 3. Deploy to Kubernetes (see K8s guide) +kubectl apply -f llama-stack-deployment.yaml +``` + +### Pattern 2: Remote Inference Throughout + +```bash +# 1. Develop with remote inference (no local models) +export FIREWORKS_API_KEY=your_key +llama stack run starter + +# 2. Deploy with same remote providers +# → No model downloads or GPU requirements +# → Consistent behavior across environments +``` + +### Pattern 3: Edge/On-Device Deployment + +For mobile or edge devices: +- [iOS SDK](/docs/distributions/ondevice_distro/ios_sdk) +- [Android SDK](/docs/distributions/ondevice_distro/android_sdk) + +## Key Concepts + +### Distributions vs Deployments + +- **Distribution**: A pre-configured package of Llama Stack with specific providers (e.g., `starter`, `meta-reference-gpu`) +- **Deployment**: How you run that distribution (library mode, Docker container, K8s cluster) + +Think of distributions as "what" and deployments as "where/how". + +### Configuration Management + +All deployment modes use a `run.yaml` configuration file: + +```yaml +apis: + - agents + - inference + - memory + - safety + +providers: + inference: + - type: ollama + config: + url: ${env.OLLAMA_URL:http://localhost:11434} +``` + +See: [Configuration Reference](/docs/distributions/configuration) + +## Next Steps + +1. **New to Llama Stack?** Start with the [Quickstart Guide](/docs/getting_started/quickstart) +2. **Ready for containers?** Check [Available Distributions](/docs/distributions/list_of_distributions) +3. **Going to production?** Follow the [Kubernetes](#production-deployment-guides) or [AWS EKS](#production-deployment-guides) guides +4. **Need help choosing?** See our [Distribution Decision Flow](/docs/distributions/list_of_distributions#decision-flow) + +## Support + +- [GitHub Discussions](https://github.com/llamastack/llama-stack/discussions) - Community support +- [GitHub Issues](https://github.com/llamastack/llama-stack/issues) - Bug reports +- [Example Applications](https://github.com/llamastack/llama-stack-apps) - Reference implementations diff --git a/docs/docs/distributions/starting_llama_stack_server.mdx b/docs/docs/distributions/starting_llama_stack_server.mdx index ed19644440..92e92bd661 100644 --- a/docs/docs/distributions/starting_llama_stack_server.mdx +++ b/docs/docs/distributions/starting_llama_stack_server.mdx @@ -7,33 +7,178 @@ sidebar_position: 7 # Starting a Llama Stack Server -You can run a Llama Stack server in one of the following ways: +Llama Stack servers can be run in three primary ways, each suited for different stages of your development lifecycle: -## As a Library: +## Deployment Lifecycle Overview -This is the simplest way to get started. Using Llama Stack as a library means you do not need to start a server. This is especially useful when you are not running inference locally and relying on an external inference service (eg. fireworks, together, groq, etc.) See [Using Llama Stack as a Library](importing_as_library) +```mermaid +graph LR + A[Development] --> B[Testing] --> C[Production] + A -.->|Library Mode| D[No Server] + B -.->|Container Mode| E[Docker/Podman] + C -.->|K8s Mode| F[Kubernetes] +``` + +Choose the deployment mode that matches your current development stage: + +| Stage | Mode | When to Use | +|-------|------|-------------| +| **Development** | Library Mode | Prototyping, no local models needed | +| **Testing** | Container Mode | Consistent testing across environments | +| **Production** | Kubernetes Mode | Scalable, production-ready deployments | + +--- + +## 1. Library Mode (Development) + +**Best for**: Quick prototyping without running a server + +This is the simplest way to get started. Using Llama Stack as a library means you do not need to start a server. This is especially useful when you are not running inference locally and relying on an external inference service (e.g. Fireworks, Together, Groq, etc.) + +**How it works**: +- Import Llama Stack directly into your Python code +- No server process needed +- Perfect for Jupyter notebooks and scripts + +**Example**: +```python +from llama_stack_client import LlamaStackClient + +# Use remote inference provider directly +client = LlamaStackClient( + base_url="https://llama-stack.fireworks.ai", + api_key="your-api-key" +) +``` + +**Learn more**: [Using Llama Stack as a Library](importing_as_library) + +--- + +## 2. Container Mode (Testing & Staging) + +**Best for**: Testing before production deployment + +Another simple way to start interacting with Llama Stack is to spin up a container (via Docker or Podman) which is pre-built with all the providers you need. We provide a number of pre-built images so you can start a Llama Stack server instantly. + +**How it works**: +- Pull a pre-built distribution image +- Run with `docker run` or `podman run` +- Configure via environment variables + +**Example**: +```bash +# Pull the starter distribution +docker pull llamastack/distribution-starter + +# Run the server +docker run -p 8321:8321 \ + -e OLLAMA_URL=http://host.docker.internal:11434 \ + llamastack/distribution-starter +``` + +**Choosing a distribution**: Which distribution to choose depends on your hardware. See [Available Distributions](./list_of_distributions) for details. + +**When to use containers**: +- ✅ Testing your application in a consistent environment +- ✅ Sharing setups with team members +- ✅ CI/CD pipelines +- ✅ Staging environments before Kubernetes +--- + +## 3. Kubernetes Mode (Production) + +**Best for**: Production deployments at scale + +If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally, we provide comprehensive deployment guides. + +**How it works**: +- Deploy container images to K8s clusters +- Configure scaling, monitoring, and high availability +- Manage with kubectl or Helm charts + +**Example**: +```bash +# Deploy to Kubernetes +kubectl apply -f llama-stack-deployment.yaml + +# Scale deployment +kubectl scale deployment llama-stack --replicas=3 +``` + +**Production deployment guides**: +- [Kubernetes Deployment Guide](../deploying/kubernetes_deployment) - General K8s deployment +- [AWS EKS Deployment Guide](../deploying/aws_eks_deployment) - AWS-specific setup + +**When to use Kubernetes**: +- ✅ Production workloads requiring high availability +- ✅ Auto-scaling based on load +- ✅ Multi-region deployments +- ✅ Enterprise-grade monitoring and logging + +--- -## Container: +## Choosing Your Deployment Mode -Another simple way to start interacting with Llama Stack is to just spin up a container (via Docker or Podman) which is pre-built with all the providers you need. We provide a number of pre-built images so you can start a Llama Stack server instantly. You can also build your own custom container. Which distribution to choose depends on the hardware you have. See [Selection of a Distribution](./list_of_distributions) for more details. +### Development Phase +Start with **Library Mode** if: +- You're prototyping or learning Llama Stack +- You're using remote inference providers (no local models) +- You don't need a standalone server -## Kubernetes: +### Testing Phase +Move to **Container Mode** when: +- You need consistent testing environments +- Multiple team members are developing +- You're integrating with CI/CD pipelines +- You want to test with local models (Ollama, vLLM) -If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](../deploying/kubernetes_deployment) for more details. +### Production Phase +Deploy with **Kubernetes Mode** when: +- You need production-grade reliability +- Auto-scaling is required +- You have DevOps infrastructure in place +- You need multi-region or high-availability setups +--- + +## Configure Logging + +Control log output via environment variables before starting the server (works for all deployment modes): + +```bash +# Set per-component log levels +export LLAMA_STACK_LOGGING=server=debug,core=info + +# Mirror logs to a file +export LLAMA_STACK_LOG_FILE=/path/to/log +``` -## Configure logging +**Supported categories**: `all`, `core`, `server`, `router`, `inference`, `agents`, `safety`, `eval`, `tools`, `client` -Control log output via environment variables before starting the server. +**Levels**: `debug`, `info`, `warning`, `error`, `critical` (default: `info`) -- `LLAMA_STACK_LOGGING` sets per-component levels, e.g. `LLAMA_STACK_LOGGING=server=debug,core=info`. -- Supported categories: `all`, `core`, `server`, `router`, `inference`, `agents`, `safety`, `eval`, `tools`, `client`. -- Levels: `debug`, `info`, `warning`, `error`, `critical` (default is `info`). Use `all=` to apply globally. -- `LLAMA_STACK_LOG_FILE=/path/to/log` mirrors logs to a file while still printing to stdout. +**Examples**: +```bash +# Debug all components +export LLAMA_STACK_LOGGING=all=debug + +# Debug server, info for everything else +export LLAMA_STACK_LOGGING=server=debug,all=info +``` Export these variables prior to running `llama stack run`, launching a container, or starting the server through any other pathway. +--- + +## Next Steps + +1. **New to Llama Stack?** Start with [Library Mode](#1-library-mode-development) +2. **Ready to test?** Try [Container Mode](#2-container-mode-testing--staging) +3. **Going to production?** Follow the [Kubernetes guides](#3-kubernetes-mode-production) +4. **Need help choosing?** See [Choosing Your Deployment Mode](#choosing-your-deployment-mode) + ```{toctree} :maxdepth: 1 :hidden: diff --git a/docs/docs/getting_started/quickstart.mdx b/docs/docs/getting_started/quickstart.mdx index 0761a6e9bd..f0de9e101c 100644 --- a/docs/docs/getting_started/quickstart.mdx +++ b/docs/docs/getting_started/quickstart.mdx @@ -10,6 +10,17 @@ Get started with Llama Stack in minutes! Llama Stack is a stateful service with REST APIs to support the seamless transition of AI applications across different environments. You can build and test using a local server first and deploy to a hosted endpoint for production. +## Development → Production Journey + +This quickstart gets you started with **local development**. Once you're ready to scale: + +1. **Testing**: Run in [Docker containers](/docs/distributions/list_of_distributions) for consistent testing +2. **Production**: Deploy to [Kubernetes](/docs/deploying/) or use [managed hosting](/docs/distributions/remote_hosted_distro/) + +The beauty of Llama Stack: **your application code stays the same** across all these environments! + +--- + In this guide, we'll walk through how to build a RAG application locally using Llama Stack with [Ollama](https://ollama.com/) as the inference [provider](/docs/providers/inference) for a Llama Model.