03 Scaling & Configuration
Goal
Auto-scale the backend based on CPU usage and manage application configuration securely with ConfigMaps and Secrets.
Estimated time
15 minutes.
Official references
Key concepts
| Concept | Purpose |
|---|---|
| ConfigMap | Store non-sensitive key-value pairs (model name, deployment settings). |
| Secret | Store sensitive data (API keys, connection strings) — base64 encoded. |
| HPA | Horizontal Pod Autoscaler — adjusts pod count based on metrics. |
| Metrics Server | Provides CPU/memory metrics for HPA decisions. |
Exercise
Step 1 — Review the ConfigMap
The ConfigMap stores the model deployment name:
You can update the deployment name without rebuilding the container:
Step 2 — Review the Secret
Production note
In production, use the Azure Key Vault CSI driver instead of Kubernetes Secrets. The workshop uses Secrets for simplicity.
Step 3 — Apply the Horizontal Pod Autoscaler
Review the HPA:
The HPA targets 10% average CPU utilization and scales between 2 and 10 replicas.
Scaling and in-memory state
The backend stores triage results in an in-memory dictionary. When multiple replicas are running, each pod has its own copy — a patient triaged on pod A won't appear in responses served by pod B. This is expected and illustrates why production workloads need external state (e.g. Redis, a database, or a Dapr state store). For this lesson, focus on observing the scaling behaviour rather than the UI patient list.
Step 4 — Verify the Metrics Server
AKS includes a Metrics Server by default:
If you see CPU and memory values, the Metrics Server is working.
Step 5 — Generate load to trigger scaling
Open a second terminal and send sustained requests:
export INGRESS_IP=$(kubectl get ingress triage-ingress -n triage -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Sustained load generator — runs until you press Ctrl+C
while true; do
for i in $(seq 1 20); do
curl -s -X POST http://$INGRESS_IP/api/triage \
-H "Content-Type: application/json" \
-d '{"name":"Load Test","age":40,"symptoms":"headache, fatigue, mild fever"}' > /dev/null &
done
sleep 0.5
done
Step 6 — Watch the HPA respond
Over a few minutes you should see REPLICAS increase as CPU usage rises.
Step 7 — Observe scale-down
Stop the load. After a few minutes the HPA scales back to the minimum:
What this lab demonstrates
- Externalising configuration with ConfigMaps.
- Managing secrets in Kubernetes (and why Key Vault is better for production).
- Configuring a Horizontal Pod Autoscaler with CPU targets.
- Observing automatic scale-out and scale-in under load.
Expected result
Under load, the backend scales from 2 to more replicas. After load stops, it scales back to 2. Configuration values are injected from ConfigMap and Secret without baking them into the image.
Verification
- [ ]
kubectl get configmap triage-config -n triageshows the deployment name. - [ ]
kubectl get hpa -n triageshows the HPA with targets and current metrics. - [ ] Under load,
REPLICASincreases above 2. - [ ] After load stops,
REPLICASreturns to 2.