Deploying a model is the easy part. Keeping it alive under load? That’s where the engineering begins Most people can run an LLM in a terminal. Few can run one on Kubernetes that doesn't melt under pressure.This project demonstrates how to deploy an Open Source LLM (TinyLlama) on a Kubernetes cluster using Kind, with monitoring provided by Prometheus and Grafana.
- Docker installed and running
- kubectl installed
- Helm installed
- Python 3.8+ installed
If you previously had K3s or other configurations, clean them up first to avoid port and permission conflicts.
# Remove old configuration directories
sudo rm -rf /etc/rancher
# Reset Kubeconfig environment variable
unset KUBECONFIG
# Install Kind binary (Linux AMD64)
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.22.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
# Create the cluster
kind create cluster --name llm-deploy
# Link kubectl to the new Kind cluster
kind export kubeconfig --name llm-deployWe use Ollama to serve the LLM inside the cluster.
# Apply the Kubernetes manifest
kubectl apply -f llm-stack.yaml
# Monitor the pod status until it is 'Running'
kubectl get pods -w
# Download the TinyLlama model into the running pod
kubectl exec -it $(kubectl get pods -l app=ollama -o name) -- ollama pull tinyllamaUse Helm to deploy a lightweight version of Prometheus and Grafana.
# Add and update the Prometheus community repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install the monitoring stack with a set password and low memory requests
helm install obs prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.resources.requests.memory=300Mi \
--set grafana.adminPassword=adminBecause the services are inside the Kubernetes network, you must forward the ports to access them locally. Open three separate terminal tabs for these commands:
Tab 1: LLM API
kubectl port-forward svc/ollama-service 11434:11434Tab 2: Grafana Dashboard
kubectl port-forward svc/obs-grafana 3000:80Tab 3: Launch Chatbot UI
# Install UI dependencies
pip install streamlit requests
# Run the application
python3 -m streamlit run app.py- Open the Chatbot UI in your browser (usually
http://localhost:8501). - Open Grafana in your browser at
http://localhost:3000. - Login to Grafana with:
- User: admin
- Password: admin
- Navigate to Dashboards -> Compute Resources / Pod and select the ollama pod.
- Interact with the Chatbot and observe the CPU and Memory usage spikes in the Grafana dashboard.
Run a prompt in the Chatbot and watch the CPU/Memory usage spike in Grafana. If it stays red, you know exactly why your infra is struggling.
Don't hog resources on your machine!
kind delete cluster --name llm-deployIf you want to understand what we did in this setup in more depth, I wrote a deep-dive here: https://heyyayush.hashnode.dev/how-to-deploy-an-open-source-llm-reliably-on-kubernetes?utm_source=hashnode&utm_medium=feed
Made with ❤️ by Ayush More. If this helped you learn how to monitor AI, give it a ⭐ and find me on LinkedIn.
