Skip to content

Monitoring

The platform provides monitoring across three pillars: metrics, logs, and traces — all backed by Azure-managed services.

AKS Automatic ships built-in Prometheus metrics collection via Azure Monitor Workspace. This captures:

  • Kubernetes control plane metrics (API server, scheduler, etcd)
  • Node-level metrics (CPU, memory, disk, network)
  • Pod-level metrics (container resource usage)
  • Istio mesh metrics (request volume, latency, error rate)

When enabled (ENABLE_GRAFANA_WORKSPACE=true), Azure Managed Grafana provides pre-built dashboards for:

  • Cluster health and node utilization
  • Pod resource consumption
  • Istio service mesh traffic
  • Karpenter node provisioning

Container Insights collects logs from all containers and forwards them to Log Analytics. Query logs via Azure Portal or az monitor log-analytics query:

// Pod logs for a specific service
ContainerLogV2
| where PodNamespace == "osdu"
| where PodName startswith "partition"
| project TimeGenerated, LogMessage
| order by TimeGenerated desc
| take 100
// Services in CrashLoopBackOff
KubePodInventory
| where Namespace == "osdu"
| where PodStatus == "Failed"
| summarize count() by Name, PodStatus
// OOM kills
ContainerLogV2
| where LogMessage contains "OOMKilled"
| project TimeGenerated, PodName, LogMessage

OSDU services emit distributed traces to Application Insights via the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable. This enables:

  • End-to-end request tracing across services
  • Dependency maps showing service-to-service and service-to-PaaS calls
  • Exception tracking and failure analysis
  • Performance analysis (latency percentiles, throughput)

Kibana is optionally exposed via the gateway module. Access the Kibana dashboard to monitor:

  • Cluster health (green/yellow/red)
  • Index status and document counts
  • Search query performance
  • Shard allocation across nodes

The Airflow web UI is optionally exposed via the gateway module. Monitor:

  • DAG execution status
  • Task run history and durations
  • Worker pod scheduling