Set Up Autoscaling for Inference Services with KEDA
Introduction
Deploying machine learning models in a production environment presents unique challenges, and one of the most critical is ensuring that your inference service can handle varying levels of traffic with efficiency and reliability. The unpredictable nature of AI workloads—where traffic can spike dramatically and resource needs fluctuate based on factors like input sequence lengths, token generation lengths, or the number of concurrent requests—often means that traditional autoscaling methods fall short.
Relying solely on CPU or memory metrics can lead to either overprovisioning and wasted resources, or underprovisioning and poor user experience. Similarly, GPU utilization can indicate efficient usage or a saturated state. Industry best practices for LLM autoscaling have therefore shifted towards more workload-specific metrics.
This guide walks through setting up KServe autoscaling by leveraging KEDA (Kubernetes Event-driven Autoscaling) and custom, application-specific metrics exported by vLLM. This combination allows inference services to scale based on actual workload signals rather than generic infrastructure metrics.
KEDA extends the standard Kubernetes Horizontal Pod Autoscaler (HPA), allowing applications to scale from zero to N instances and back down based on a wide variety of event sources—including Prometheus metrics. It introduces an open and extensible framework so that KServe can scale on virtually any signal relevant to your AI model's performance.
Prerequisites
- Alauda AI Platform with KServe installed.
- KEDA (Custom Metrics Autoscaler) installed on the cluster.
- An
InferenceServiceusingRawDeploymentmode with a vLLM serving runtime. - Prometheus installed and accessible in the cluster.
Grant KServe Access to KEDA Resources
Before proceeding, apply the following RBAC resources to allow kserve-controller-manager to manage KEDA objects (ScaledObject, TriggerAuthentication, etc.):
Installation Order
If KEDA was installed after the Alauda AI, restart the kserve-controller-manager pod (in the kserve namespace) after applying the RBAC above, so it can discover the KEDA CRDs:
Steps
Stop the Running InferenceService
Before making changes, stop the running InferenceService to avoid conflicts between the existing HPA and the new KEDA-managed scaler. Add the following annotation to stop it:
If a running InferenceService already has an HPA resource, switching to KEDA without stopping it first will cause a resource conflict.
Create the Prometheus TriggerAuthentication
KEDA requires a TriggerAuthentication resource in the same namespace as your InferenceService to authenticate with Prometheus.
The Prometheus credentials are stored in the platform secret kube-prometheus-alertmanager-basic-auth in the cpaas-system namespace. Run the following command to copy them into your namespace:
Then create the TriggerAuthentication that references it:
The following names must be consistent across all resources:
prom-basic-auth-secret— theSecretname, must matchsecretTargetRef.nameinside theTriggerAuthentication.prom-basic-auth— theTriggerAuthenticationname, must matchauthenticationRef.authenticationRef.namein theInferenceServicespec.
Configure the InferenceService for KEDA
After the service is stopped, update the InferenceService manifest with the KEDA autoscaling configuration:
- Disables the built-in KServe HPA and delegates scaling to KEDA.
- References the
TriggerAuthenticationresource that holds the credentials for authenticating with Prometheus. Replaceprom-basic-authwith the name of your actualTriggerAuthentication. - A PromQL query that returns the current load as a single numeric value. Replace
<your-model-name>and<your-namespace>with your actual values. - The internal address of your Prometheus instance, e.g.,
http://prometheus-operated.cpaas-system.svc.cluster.local:9090. - The per-replica target value. KEDA computes
ceil(metricValue / value)to determine the desired number of replicas.
vLLM Metrics for Autoscaling
vLLM metrics are automatically collected by the platform. Choosing the right metric is arguably the most crucial part of the setup—the Prometheus query must return a single numeric value that accurately reflects the current load on your model.
The following vLLM metrics are commonly used for autoscaling:
Use the sum() aggregation function to ensure the query returns a single value across all pods of your deployment. For example, to scale based on the number of waiting requests:
This sums up all pending requests across all predictor pods, giving KEDA a single aggregate signal to act on.
Verify the Setup
After applying the updated InferenceService, KServe will automatically create a KEDA ScaledObject on your behalf. Verify that everything is working:
The HPA output will show the current metric value, the scaling threshold, and the current/desired replica counts. As inference traffic increases, the TARGETS value will rise and replicas will scale up automatically.