Skip to content

ADR-0007: Karpenter NodePools for Workload Isolation

Status: Accepted
Date: 2026-04-03
Deciders: danielscholl

AKS Automatic uses Karpenter (via Node Auto-Provisioning) for dynamic node scaling. The cluster runs both stateful middleware (Elasticsearch, PostgreSQL, Redis) and stateless OSDU microservices, with potentially multiple stacks side-by-side. Without workload isolation, all pods compete for the same nodes, and stateful workloads requiring premium storage or specific VM SKUs may be scheduled on inadequate nodes.

  • Elasticsearch and PostgreSQL require premium storage-capable VMs for persistent volumes
  • Stateful middleware benefits from larger VMs (D4/D8) while OSDU services have lighter per-pod requirements
  • Side-by-side stacks must not cross-schedule pods onto each other’s nodes
  • Karpenter NodePools support taints, labels, and VM SKU requirements for workload targeting
  • Cost control: consolidation policies should reclaim underutilized nodes
  1. Single system node pool for all workloads (AKS Automatic default)
  2. Static node pools with fixed VM sizes
  3. Karpenter NodePools with per-workload-class isolation

Chosen option: Karpenter NodePools with per-workload-class isolation, because it combines dynamic scaling with workload-appropriate VM selection and taint-based isolation between stacks.

Each stack creates two NodePools:

NodePoolPurposeVM FamilyPremium StorageTaint
platform / platform-{id}Middleware (Elasticsearch, PostgreSQL, Redis)D-series, 4-8 vCPURequiredworkload=platform:NoSchedule
osdu / osdu-{id}OSDU microservicesD-series, 4-8 vCPU, >15 GiB memoryRequiredworkload=osdu:NoSchedule

Pods target their NodePool via nodeSelector on the agentpool label and a matching toleration for the workload taint.

Both NodePools use WhenEmptyOrUnderutilized consolidation with a 5-minute delay, allowing Karpenter to reclaim nodes when workloads scale down.

  • Good: Stateful middleware gets premium storage-capable VMs automatically
  • Good: OSDU service pods are isolated from middleware node disruption
  • Good: Per-stack NodePools prevent cross-scheduling in side-by-side deployments
  • Good: Karpenter dynamically selects the cheapest VM SKU meeting requirements
  • Bad: NodePool isolation can lead to more nodes than a shared pool (less bin-packing)
  • Bad: Agentpool label values cannot contain hyphens (AKS Karpenter restriction), requiring label/name divergence for stacks with hyphenated IDs