"Automate everything. Manual is the enemy of scale."
I'm an AI Infrastructure Engineer with 18+ years of hands-on experience designing and operating enterprise-grade, mission-critical compute infrastructure. My focus is GPU/AI infrastructure — KVM/QEMU GPU virtualization, NUMA-aware and low-latency tuning, SR-IOV, NCCL-validated multi-node GPU fabrics, and Kubernetes/OpenShift orchestration — backed by deep Linux, Ansible/Terraform automation, and hybrid cloud experience.
- 🔧 Currently building AI/GPU infrastructure solutions at Ali Bin Ali Technology Solutions (ABATS)
- 🚀 Engineered KVM/QEMU/libvirt infrastructure for production GPU/AI workloads (NUMA, hugepages, virtio/SR-IOV)
- 🔬 Validated multi-node GPU fabric throughput with NCCL benchmarks and NUMA-aware tuning
- 📦 Built CI/CD-driven, reproducible OS image pipelines for immutable, versioned node rollouts
- ⚙️ Reduced task completion time by 40% through Ansible-driven automation
- 💸 Cut infrastructure costs by 30% via strategic VM migration (Hyper-V/VMware → KVM)
- 📉 Slashed deployment time from 120 min → 20 min using Red Hat Satellite
- ✅ Maintained 100% SLA compliance across all delivered projects
🔨 What I build and share — GPU/AI infrastructure, Ansible roles, image pipelines, and infrastructure-as-code
| Repository | Description | Tech |
|---|---|---|
🔬 gpu-nccl-benchmarking |
Multi-node GPU fabric deployment & NCCL throughput validation on KVM/QEMU | KVM, NCCL, NUMA |
⚡ kvm-gpu-passthrough |
GPU passthrough / vGPU setup with SR-IOV, hugepages & NUMA tuning | KVM, QEMU, libvirt |
🧱 k8s-hardened-node-images |
Minimal, CIS-hardened Debian/Garden Linux images for Kubernetes nodes | dracut, systemd-boot, CI/CD |
📈 gpu-host-observability |
Prometheus/Grafana dashboards for GPU host & AI workload monitoring | Prometheus, Grafana |
| Repository | Description | Tech |
|---|---|---|
🗂️ ansible-linux-baseline |
Hardening & baseline configuration for RHEL/Oracle Linux | Ansible, RHEL |
🔄 ansible-patch-management |
Automated OS patching with pre/post health checks | Ansible, Satellite |
📦 ansible-satellite-lifecycle |
Lifecycle workflows using Red Hat Satellite 6.x | Ansible, Satellite |
🛡️ ansible-splunk-deploy |
Deploy & configure Splunk forwarders, indexers, search heads | Ansible, Splunk |
☁️ ansible-azure-infra |
Provision and configure Azure resources with Ansible | Ansible, Azure |
🖥️ ansible-kvm-migration |
Automate P2V and V2V migrations to KVM/RHEV | Ansible, KVM |
| Repository | Description | Tech |
|---|---|---|
🔍 linux-health-check |
Comprehensive server health audit scripts | Bash, Python |
📋 openshift-node-inspector |
OpenShift node log & health diagnostics | Bash, OCP |
💾 hpux-ignite-automation |
HP-UX Ignite backup & restore automation | Bash, HP-UX |
🔐 linux-security-audit |
CIS benchmark compliance checker for Linux | Bash, Python |
┌─────────────────────────────────────────────────────────────────┐
│ GPU / AI INFRA & AUTOMATION WINS │
├────────────────────────────────────┬────────────────────────────┤
│ GPU Fabric Throughput │ NCCL-validated 🔬 │
│ Node Rollout (CI/CD image pipe) │ Immutable & versioned 📦 │
│ Deployment Time (RH Satellite) │ 120 min ──► 20 min 🚀 │
│ Task Completion Time (Ansible) │ Reduced by 40% ⚡ │
│ Infra Cost (KVM Migration) │ Saved 30% 💰 │
│ SLA Compliance │ 100% ✅ │
│ Operational Performance Gain │ +30% 📈 │
└────────────────────────────────────┴────────────────────────────┘
# What I build for GPU/AI infrastructure at scale
---
omkar_ai_infra_skills:
gpu_ai:
- KVM/QEMU GPU virtualization (passthrough & vGPU)
- NUMA alignment, hugepages & low-latency tuning
- SR-IOV & virtio for GPU-intensive workloads
- NCCL multi-node fabric benchmarking
- AI / Deep Learning workload provisioning
orchestration:
- Kubernetes / OpenShift cluster operations
- GPU node provisioning & resource management
- LXC/LXD container management
automation_iac:
- Ansible configuration management at scale
- Terraform infrastructure-as-code
- Reproducible CI/CD OS image pipelines (cloud-init, dracut, systemd-boot)
- Zero-touch server provisioning
observability:
- Prometheus / Grafana GPU host & workload dashboards
- RCA (P1/P2), runbooks & security compliance
integrations:
- Red Hat Satellite 6.10 / Oracle Linux Manager
- GitHub Actions
- Azure / AWS / OCI
- OpenShift / OCP clusters2007 ──► Embee Software │ HP-UX & Windows L2 Support
2010 ──► AtoS India │ Linux/Windows L3 Admin (7.5 yrs)
2018 ──► Vyom Labs │ Automation & OpenShift Lead
2019 ──► KBC Technologies (QA) │ Sr. Sysadmin @ Ooredoo Telecom
2020 ──► EBLA Consultancy │ Sr. Sysadmin & Backup Admin
2022 ──► ABATS (Present) │ AI Infrastructure Engineer ◄── NOW
I'm always open to discussions around GPU/AI infrastructure, Linux automation, Ansible best practices, hybrid cloud architecture, or infrastructure optimization.