Skip to content

Ayushmore1214/llm-k8s-deployment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

K8s LLM Observability Stack

Deploying a model is the easy part. Keeping it alive under load? That’s where the engineering begins Most people can run an LLM in a terminal. Few can run one on Kubernetes that doesn't melt under pressure.This project demonstrates how to deploy an Open Source LLM (TinyLlama) on a Kubernetes cluster using Kind, with monitoring provided by Prometheus and Grafana.

Prerequisites

  • Docker installed and running
  • kubectl installed
  • Helm installed
  • Python 3.8+ installed

1. Environment Cleanup and Cluster Setup

If you previously had K3s or other configurations, clean them up first to avoid port and permission conflicts.

# Remove old configuration directories
sudo rm -rf /etc/rancher

# Reset Kubeconfig environment variable
unset KUBECONFIG

# Install Kind binary (Linux AMD64)
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Create the cluster
kind create cluster --name llm-deploy

# Link kubectl to the new Kind cluster
kind export kubeconfig --name llm-deploy

2. Deploy LLM Infrastructure

We use Ollama to serve the LLM inside the cluster.

# Apply the Kubernetes manifest
kubectl apply -f llm-stack.yaml

# Monitor the pod status until it is 'Running'
kubectl get pods -w

# Download the TinyLlama model into the running pod
kubectl exec -it $(kubectl get pods -l app=ollama -o name) -- ollama pull tinyllama

3. Install Monitoring Stack

Use Helm to deploy a lightweight version of Prometheus and Grafana.

# Add and update the Prometheus community repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install the monitoring stack with a set password and low memory requests
helm install obs prometheus-community/kube-prometheus-stack \
  --set prometheus.prometheusSpec.resources.requests.memory=300Mi \
  --set grafana.adminPassword=admin

4. Application Access (Port Forwarding)

Because the services are inside the Kubernetes network, you must forward the ports to access them locally. Open three separate terminal tabs for these commands:

Tab 1: LLM API

kubectl port-forward svc/ollama-service 11434:11434

Tab 2: Grafana Dashboard

kubectl port-forward svc/obs-grafana 3000:80

Tab 3: Launch Chatbot UI

# Install UI dependencies
pip install streamlit requests

# Run the application
python3 -m streamlit run app.py

5. Usage and Monitoring

  1. Open the Chatbot UI in your browser (usually http://localhost:8501).
  2. Open Grafana in your browser at http://localhost:3000.
  3. Login to Grafana with:
    • User: admin
    • Password: admin
  4. Navigate to Dashboards -> Compute Resources / Pod and select the ollama pod.
  5. Interact with the Chatbot and observe the CPU and Memory usage spikes in the Grafana dashboard.

Run a prompt in the Chatbot and watch the CPU/Memory usage spike in Grafana. If it stays red, you know exactly why your infra is struggling.

🧹 The Clean-Up

Don't hog resources on your machine!

kind delete cluster --name llm-deploy

If you want to understand what we did in this setup in more depth, I wrote a deep-dive here: https://heyyayush.hashnode.dev/how-to-deploy-an-open-source-llm-reliably-on-kubernetes?utm_source=hashnode&utm_medium=feed

Made with ❤️ by Ayush More. If this helped you learn how to monitor AI, give it a ⭐ and find me on LinkedIn.

About

This project demonstrates how to deploy an Open Source LLM (TinyLlama) on a Kubernetes cluster using Kind, with monitoring provided by Prometheus and Grafana.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages