
Dynamic Model Autoscaling
Supports UIby rh-aiservices-bu
Metrics-based GPU autoscaling for LLM inference services on OpenShift AI using KEDA and vLLM.
What it does
Dynamic Model Autoscaling provides an interactive framework for managing GPU workloads on OpenShift AI. It leverages KEDA (Kubernetes Event-driven Autoscaling) to scale vLLM inference services based on real-time request queue depth, ensuring optimal resource utilization and performance.
Key features
- vLLM Metric Integration: Scales based on
num_requests_waitingandnum_requests_runningmetrics scraped via Prometheus. - Automated KEDA Provisioning: Automatically creates ScaledObjects and TriggerAuthentications when the autoscaler class is set to KEDA.
- Scale-to-Zero: Supports extreme cost optimization by scaling models down to zero replicas using the KEDA HTTP Add-on.
- Cold Start Management: Includes a custom interceptor to send SSE keepalive events during long LLM cold starts.
Installation
This application is deployed as a set of Helm charts on an OpenShift AI cluster. To deploy a model with autoscaling:
helm install llama3-2-3b helm/llama3.2-3b/ --set keda.enabled=true -n autoscaling-keda
For scale-to-zero, install the KEDA HTTP Add-on:
helm install http-add-on kedacore/keda-add-ons-http -n openshift-keda
Supported hosts
- claude
Quick install
helm install keda-operator helm/keda-operator/ -n openshift-kedaInformation
- Pricing
- free
- Published







