04 Monitoring & Observability
Goal
Enable monitoring for the AKS cluster and application using Azure Monitor Container Insights, query logs, and set up alerts.
Estimated time
15 minutes.
Official references
- Container Insights overview
- Enable Container Insights on AKS
- Log queries for Container Insights
- Recommended metric alerts for AKS
Key concepts
| Concept | Purpose |
|---|---|
| Container Insights | Collects performance and log data from AKS clusters. |
| Log Analytics workspace | Stores and queries container logs with KQL. |
| Prometheus metrics | AKS can emit metrics in Prometheus format for Grafana/Azure Monitor. |
| Alerts | Notify on pod restarts, high CPU, node pressure, etc. |
Exercise
Step 1 — Enable Container Insights
az aks enable-addons \
--resource-group $RESOURCE_GROUP \
--name $AKS_CLUSTER_NAME \
--addons monitoring
This creates a Log Analytics workspace (or attaches to an existing one) and deploys the monitoring agent to your cluster.
Step 2 — Verify the monitoring agent
You should see ama-logs and ama-metrics pods running.
Step 3 — View live metrics in the Azure Portal
- Open the Azure Portal.
- Navigate to your AKS cluster.
- Click Monitoring → Insights.
- Explore the Cluster, Nodes, Controllers, and Containers tabs.
Step 4 — Query container logs with KQL
In the Azure Portal:
- Navigate to your AKS cluster → Monitoring → Logs.
- Run this query to see recent backend logs:
ContainerLogV2
| where ContainerName == "backend"
| where LogMessage contains "triage"
| project TimeGenerated, LogMessage
| order by TimeGenerated desc
| take 20
Step 5 — Query pod restart events
KubeEvents
| where Reason == "BackOff" or Reason == "Unhealthy" or Reason == "Failed"
| where Namespace == "triage"
| project TimeGenerated, Name, Reason, Message
| order by TimeGenerated desc
| take 10
Step 6 — Check resource utilisation
InsightsMetrics
| where Namespace == "container.azm.ms/cpuUsage"
| where Name == "cpuUsageNanoCores"
| extend Pod = tostring(parse_json(Tags).podName)
| where Pod contains "triage-backend"
| summarize AvgCPU = avg(Val) by bin(TimeGenerated, 5m)
| render timechart
Note
CPU metrics may take 10–15 minutes to appear after enabling the monitoring
addon. If the query returns no results, verify data is flowing with:
InsightsMetrics | where TimeGenerated > ago(15m) | take 5
Step 7 — Create a basic alert (Portal)
- Navigate to your Log Analytics workspace → Logs.
- Run this query:
KubeEvents
| where Namespace == "triage"
| where Reason == "BackOff" or Reason == "Unhealthy" or Reason == "Failed"
- Click New alert rule (top menu).
- Set condition: results greater than 0, evaluation period 5 minutes.
- Choose an action group (or create one with your email).
- Name the rule
triage-pod-failuresand create it.
Tip
Log-based alerts query the Log Analytics workspace directly and don't require Prometheus metric collection to be fully warmed up.
Step 8 — View metrics from the CLI
What this lab demonstrates
- Enabling Container Insights with a single CLI command.
- Browsing live cluster metrics in the Azure Portal.
- Querying structured container logs with KQL.
- Setting up alerts for pod failures.
- Using
kubectl topfor quick resource checks.
Expected result
Container Insights is active on the cluster. You can see live metrics in the Portal, query logs with KQL, and an alert rule fires if any pod enters a failed state.
Verification
- [ ]
kubectl get pods -n kube-system | grep ama-shows monitoring agent pods. - [ ] Azure Portal → AKS cluster → Insights shows live cluster data.
- [ ] A KQL query in the Logs blade returns container logs.
- [ ] An alert rule named
triage-pod-failuresexists. - [ ]
kubectl top pods -n triageshows CPU and memory usage.