04 Monitoring & Observability

Goal

Enable monitoring for the AKS cluster and application using Azure Monitor Container Insights, query logs, and set up alerts.

Estimated time

15 minutes.

Official references

Key concepts

Concept	Purpose
Container Insights	Collects performance and log data from AKS clusters.
Log Analytics workspace	Stores and queries container logs with KQL.
Prometheus metrics	AKS can emit metrics in Prometheus format for Grafana/Azure Monitor.
Alerts	Notify on pod restarts, high CPU, node pressure, etc.

Exercise

Step 1 — Enable Container Insights

az aks enable-addons \
  --resource-group $RESOURCE_GROUP \
  --name $AKS_CLUSTER_NAME \
  --addons monitoring

This creates a Log Analytics workspace (or attaches to an existing one) and deploys the monitoring agent to your cluster.

Step 2 — Verify the monitoring agent

kubectl get pods -n kube-system | grep ama-

You should see ama-logs and ama-metrics pods running.

Step 3 — View live metrics in the Azure Portal

Open the Azure Portal.
Navigate to your AKS cluster.
Click Monitoring → Insights.
Explore the Cluster, Nodes, Controllers, and Containers tabs.

Step 4 — Query container logs with KQL

In the Azure Portal:

Navigate to your AKS cluster → Monitoring → Logs.
Run this query to see recent backend logs:

ContainerLogV2
| where ContainerName == "backend"
| where LogMessage contains "triage"
| project TimeGenerated, LogMessage
| order by TimeGenerated desc
| take 20

Step 5 — Query pod restart events

KubeEvents
| where Reason == "BackOff" or Reason == "Unhealthy" or Reason == "Failed"
| where Namespace == "triage"
| project TimeGenerated, Name, Reason, Message
| order by TimeGenerated desc
| take 10

Step 6 — Check resource utilisation

InsightsMetrics
| where Namespace == "container.azm.ms/cpuUsage"
| where Name == "cpuUsageNanoCores"
| extend Pod = tostring(parse_json(Tags).podName)
| where Pod contains "triage-backend"
| summarize AvgCPU = avg(Val) by bin(TimeGenerated, 5m)
| render timechart

Note

CPU metrics may take 10–15 minutes to appear after enabling the monitoring addon. If the query returns no results, verify data is flowing with: InsightsMetrics | where TimeGenerated > ago(15m) | take 5

Step 7 — Create a basic alert (Portal)

Navigate to your Log Analytics workspace → Logs.
Run this query:

KubeEvents
| where Namespace == "triage"
| where Reason == "BackOff" or Reason == "Unhealthy" or Reason == "Failed"

Click New alert rule (top menu).
Set condition: results greater than 0, evaluation period 5 minutes.
Choose an action group (or create one with your email).
Name the rule triage-pod-failures and create it.

Tip

Log-based alerts query the Log Analytics workspace directly and don't require Prometheus metric collection to be fully warmed up.

Step 8 — View metrics from the CLI

kubectl top nodes
kubectl top pods -n triage

What this lab demonstrates

Enabling Container Insights with a single CLI command.
Browsing live cluster metrics in the Azure Portal.
Querying structured container logs with KQL.
Setting up alerts for pod failures.
Using kubectl top for quick resource checks.

Expected result

Container Insights is active on the cluster. You can see live metrics in the Portal, query logs with KQL, and an alert rule fires if any pod enters a failed state.

Verification

[ ] kubectl get pods -n kube-system | grep ama- shows monitoring agent pods.
[ ] Azure Portal → AKS cluster → Insights shows live cluster data.
[ ] A KQL query in the Logs blade returns container logs.
[ ] An alert rule named triage-pod-failures exists.
[ ] kubectl top pods -n triage shows CPU and memory usage.