Troubleshooting
Deployment Issues
Section titled “Deployment Issues”Safeguards Gate Timeout
Section titled “Safeguards Gate Timeout”Symptom: post-provision.ps1 hangs waiting for Deployment Safeguards readiness.
Cause: Azure Policy/Gatekeeper is eventually consistent. Fresh clusters can take several minutes before ValidatingAdmissionPolicies are reconciled.
Resolution: Wait up to 10 minutes. If it persists, verify the cluster is running:
az aks show -g <rg> -n <cluster> --query powerStatekubectl get validatingadmissionpoliciesTerraform State Conflicts
Section titled “Terraform State Conflicts”Symptom: terraform apply fails with state lock errors.
Cause: A previous run was interrupted, leaving the state lock.
Resolution: Each layer has independent state — identify which layer failed:
# Check which layer has the lockcd infra && terraform force-unlock <lock-id># orcd software/foundation && terraform force-unlock <lock-id># orcd software/spi-stack && terraform force-unlock <lock-id>Pod Issues
Section titled “Pod Issues”CrashLoopBackOff on OSDU Services
Section titled “CrashLoopBackOff on OSDU Services”Symptom: Pod enters CrashLoopBackOff shortly after creation.
Common causes:
-
Wrong probe port — the service uses a different health endpoint than the default (8081). Check ADR-0005 for the probe matrix.
Terminal window kubectl describe pod <name> -n osdu | grep -A5 "Liveness"kubectl logs <name> -n osdu | grep "started on port" -
Missing ConfigMap values — the service can’t connect to Azure PaaS resources.
Terminal window kubectl get configmap -n osdukubectl describe pod <name> -n osdu | grep -A5 "Environment" -
Workload Identity not ready — the federated credential hasn’t propagated.
Terminal window kubectl describe pod <name> -n osdu | grep -A3 "azure.workload.identity"
Istio Sidecar Not Injected
Section titled “Istio Sidecar Not Injected”Symptom: Pods have 1 container instead of 2 (missing istio-proxy).
Resolution:
-
Verify namespace labels:
Terminal window kubectl get namespace osdu --show-labels | grep istio -
Verify CNI chaining is enabled:
Terminal window kubectl get daemonset -n aks-istio-system | grep cni -
Restart pods to pick up sidecar:
Terminal window kubectl rollout restart deployment -n osdu
Deployment Safeguards Rejection
Section titled “Deployment Safeguards Rejection”Symptom: Pod creation fails with admission webhook denied the request.
Resolution: Check which policy is blocking:
kubectl get events -n osdu --field-selector reason=FailedCreateCommon violations:
- Missing
seccompProfile— ensure the Helm chart setsRuntimeDefault - Missing resource limits — all containers must have
requestsandlimits - Running as root —
runAsNonRoot: truemust be set
Middleware Issues
Section titled “Middleware Issues”Elasticsearch Cluster Health Yellow/Red
Section titled “Elasticsearch Cluster Health Yellow/Red”# Check cluster healthkubectl exec -n platform elasticsearch-es-default-0 -- \ curl -s -k https://localhost:9200/_cluster/health | jq
# Check unassigned shardskubectl exec -n platform elasticsearch-es-default-0 -- \ curl -s -k https://localhost:9200/_cat/shards?v | grep UNASSIGNEDRedis Connection Refused
Section titled “Redis Connection Refused”Verify Redis TLS is enabled and the client is connecting with TLS:
kubectl exec -n platform redis-master-0 -- redis-cli --tls pingPostgreSQL (CNPG) Not Ready
Section titled “PostgreSQL (CNPG) Not Ready”# Check cluster statuskubectl get cluster -n platformkubectl describe cluster postgresql -n platformConnectivity Issues
Section titled “Connectivity Issues”Services Can’t Reach Azure PaaS
Section titled “Services Can’t Reach Azure PaaS”Verify Workload Identity:
# Check service account annotationkubectl get sa -n osdu -o yaml | grep azure.workload.identity
# Check pod identity injectionkubectl get pod <name> -n osdu -o jsonpath='{.spec.containers[0].env}' | jqDNS Resolution Failures
Section titled “DNS Resolution Failures”# Check ExternalDNS is runningkubectl get pods -n foundation | grep external-dns
# Check DNS recordskubectl logs -n foundation deployment/external-dns | grep "Desired"