Diving Into Kubernetes Diagnostics: My K8sGPT Journey as a Rookie
Greetings, Kubernetes enthusiasts! As a rookie to native Kubernetes—still finding my footing in logs and YAML—I was excited to dive into K8sGPT. This CNCF Sandbox project harnesses AI to scan Kubernetes clusters, diagnose issues, and explain them in simple English, weaving in SRE expertise with cutting-edge technology. I set it up on a local VMware VM running Ubuntu Server 24.04.3, and the experience was both educational and practical. In this blog, I’ll share what K8sGPT is, how I got it running, my hands-on experiences as a beginner, and an invitation for you to explore it too.
What is K8sGPT?
K8sGPT is a tool designed to simplify Kubernetes troubleshooting with a two-phase approach. Phase 1 runs built-in checks to detect issues, while Phase 2 enriches these findings with AI explanations using LLMs like OpenAI or self-hosted options. According to its website, it codifies SRE experience into analyzers, pulling relevant data and anonymizing sensitive details to provide actionable insights. Key features include:
- Filters: Predefined rules for resources like Pods, ConfigMaps, and Deployments.
- AI Enhancements: Offers natural language explanations and remediation steps.
- Extensibility: Supports integrations (e.g., Trivy) and custom analyzers.
- Flexibility: Works with hosted or self-hosted AI backends.
- CLI Focus: Simple commands for quick analysis.
It’s a great starting point for rookies like me to learn SRE practices and a flexible tool for seasoned engineers to customize. I followed the official documentation to set it up, which provided clear guidance on installation and configuration.
My Setup: A Local VMware Adventure
I set up K8sGPT on a local VMware VM with Ubuntu Server 24.04.3, using 32 cores, 20GB RAM, and 200GB storage—perfect for a beginner’s cluster. Here’s how I got started:
-
VM and Kubernetes Install:
- Launched the VM and connected via SSH.
- Installed Kubernetes.
- Checked status with
kubectl get pods -A—saw Cilium, CoreDNS, and other components running (e.g.,cilium-envoy-chwqk 1/1 Running). - Enabled kubectl completion with
source <(kubectl completion bash).
-
K8sGPT Install:
- Downloaded the binary:
curl -sLO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.4.25/k8sgpt_amd64.deb. - Installed with
sudo dpkg -i k8sgpt_amd64.deb. - Confirmed with
which k8sgpt(/usr/bin/k8sgpt) andk8sgpt version(v0.4.25). - Enabled completion with
source <(k8sgpt completion bash).
- Downloaded the binary:
-
AI Configuration:
- Started with OpenAI:
k8sgpt auth add --backend openai --model gpt-4o-mini. - Tested self-hosted LocalAI later, though 20GB RAM struggled with larger models.
- Started with OpenAI:
The process took about 25 minutes, with Kubernetes stabilization being the longest wait. My VM’s specs worked well, but RAM was a bottleneck for self-hosted AI—consider 32GB as suggested by the lab.
My Experiences: Insights and Challenges as a Rookie
Following the lab, I broke and fixed my cluster to test K8sGPT. Here’s what I learned:
The Wins: Making Sense of Chaos
- Phase 1 (Without AI): Ran
k8sgpt analyze—flagged unused ConfigMaps likekube-root-ca.crt. Created a broken pod (kubectl run brokenpod --image=nginx:unknown_version), and it caught “ErrImagePull” with details. The-sflag showed analyzer times (e.g., Pod took 26ms), which helped me understand performance. - Phase 2 (With AI): Added
--explain—AI explained the pod issue: “Failed to pull image due to invalid tag; use ’nginx:latest’.” Initially hit an error (“AI provider not specified”), butk8sgpt authfixed it. Self-hosted LocalAI worked but was slower on my CPU-only setup. - Filters and Rules: Listed active filters with
k8sgpt filters list—covered Pods, Deployments, etc. Peeked at thepod.gosource to see how it detects errors like “ErrImagePull”. - Fixing Issues: Patched the pod with
kubectl patch pod brokenpod -p '{"spec":{"containers":[{"name":"brokenpod","image":"nginx:latest"}]}}'—k8sgpt analyze --filter Podconfirmed no errors after a short wait.
It turned my rookie struggles into manageable steps, with AI guiding me like a mentor.
The Gotchas:
- LocalAI is Resource Heavy: The self-hosted AI option was resource-intensive. With only 20GB RAM on my virtual machine, loading models was slow, particularly for larger ones. The documentation highlighted that language models require significant memory—ideally 32GB or more—and my experience confirmed this. For a smoother experience, using a hosted AI backend was a faster choice, especially on my modest setup.
Wrapping Up: K8sGPT as a Rookie’s Best Friend
K8sGPT transforms Kubernetes troubleshooting with AI-driven clarity, making it accessible even for rookies like me. My VM-based trial showed its potential, despite some resource tweaks needed. I’m excited to keep exploring it in my learning journey.
Ready to try it? Dive into the official Getting Started Guide.
Happy clustering!
Bastiaan van der Bijl