Kubernetes operator for local LLM inference with llama.cpp, vLLM, and TGI - multi-GPU, autoscaling, air-gapped, production-ready
Stars
44
Forks
7
Watchers
44
Open Issues
19
Overall repository health assessment
No package.json found
This might not be a Node.js project
155
commits
30
commits
25
commits
3
commits
feat: add vLLM and TGI runtime backends with per-runtime HPA metrics (#273)
441c7c7View on GitHubfeat: add first-class PersonaPlex (Moshi) runtime backend (#272)
2b1c948View on GitHubfeat: add pluggable runtime backends for non-llama.cpp inference engines (#271)
bb1576cView on GitHubfeat: add Grafana inference metrics dashboard (#269)
be376c6View on GitHubfeat: separate image registry from repository in Helm chart (#268)
5c059a4View on GitHubfeat: support custom layer splits from GPUShardingSpec (#267)
a37701cView on GitHubfeat!: update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support (#262)
cc9a95eView on GitHubchore(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc from 1.42.0 to 1.43.0 (#266)
c7c97b2View on GitHubfeat: add HPA autoscaling for InferenceService (#260)
2d16502View on GitHubfeat: add Ollama as runtime backend for Metal agent (#258)
6148b89View on GitHubfeat: add oMLX as alternative runtime backend for Metal agent (#257)
eaf9045View on GitHubfeat: add KV cache type configuration and extraArgs escape hatch (#256)
7a4b855View on GitHub