Skip to content
GitHub stars

Controller Resource Sizing

Right-size the Lynq controller pod. This page explains why Lynq uses resources the way it does, then gives you the numbers to set requests and limits confidently.

Time to working

5 minutes to pick the right tier and update your Helm values or manager YAML.

Observed Usage (Real Data)

EnvironmentLynqNodesCPU (observed)Memory (observed)
Small50 – 5 m10 – 60 MB
Large300+< 100 m< 120 MB

Benchmark environment for the 300+ data point:

  • 1 LynqHub, multiple LynqForms across a few hundred active rows
  • Several thousand managed Kubernetes objects in cache
  • Annotation-driven skip path active (the vast majority of reconciles complete with zero child-resource API writes because lynq.sh/applied-hash matches the desired-spec hash)

CPU is bursty — it spikes briefly during periodic force-reapply cycles (every ForceReapplyInterval, default 10 min) and template-change events, then drops to near-idle. Memory grows with the number of managed objects (controller-runtime informer cache) and stays roughly flat within a stable cluster. Steady-state CPU is dominated by watch event processing, not reconciliation. See Architecture › Drift Correction.

Why Lynq Uses Resources This Way

Understanding the two resource drivers makes sizing intuitive.

CPU — driven by reconciliation workload

Lynq's CPU usage is almost entirely reconciliation work, not background polling. The controller sits idle until something triggers a reconcile, then it processes work concurrently up to the --node-concurrency limit.

Each LynqNode reconciliation involves:

  1. Template rendering — evaluate Go templates for each managed resource
  2. Dependency graph traversal — topological sort of dependIds
  3. Kubernetes API calls — one Server-Side Apply (SSA) call per resource
  4. Status update — patch the LynqNode status and conditions

Steps 3 and 4 dominate. A LynqNode that manages 9 resources makes 9 SSA calls per reconcile — roughly 9× more CPU work than a LynqNode with 1 resource. This means total managed resources across all LynqNodes is the primary CPU driver, not just the node count.

Why CPU is spiky: during a hub sync (every syncInterval, e.g. 1 minute), all out-of-date nodes are enqueued at once. CPU jumps to the peak value briefly, then falls back to near-idle as the queue drains.

Memory — driven by the controller cache

The controller-runtime framework keeps an in-memory cache of every Kubernetes object it watches. This cache enables instant reads without hitting the API server on every reconcile. It is the dominant memory consumer, and it scales with total managed object count, not just LynqNode count.

What lives in the cache (benchmark environment at 220 nodes, ~1,025 managed objects):

Cache contents5 nodes estimate220 nodes (observed)
LynqNode CRs~75 KB~3.3 MB
Managed resources (~1,025 objects)~0.1 MB~70–110 MB
LynqHub + LynqForm CRs~50 KB~50 KB
Go runtime + controller overhead~20–30 MB~20–30 MB
Total~20–30 MB~100–140 MB

Each cached Kubernetes object is a deserialized Go struct (roughly 70–110 KB). 1,025 objects × ~100 KB/object ≈ 100 MB, which matches the observed range.

Memory is stable, not spiky: once the cache is warm it stays flat. Growth happens only when total managed object count increases.

Sizing Table

Set requests to match observed usage at your expected scale. Set limits to 2–3× requests to absorb burst — especially for CPU.

The table below uses the benchmark resource density (avg ~4.7 managed resources per LynqNode). See Adjusting for Different Densities if your forms define significantly more or fewer resources.

LynqNodesCPU requestCPU limitMemory requestMemory limit
< 5025m100m64Mi128Mi
50 – 20050m200m96Mi192Mi
200 – 500100m300m128Mi256Mi
500 – 1000200m500m256Mi512Mi
1000+500m1000m512Mi1Gi

These numbers are anchored to the 300+ node production observation (< 100 m CPU steady-state, < 120 MB memory). Limits are sized 2–3× above steady-state to absorb force-reapply bursts (every ForceReapplyInterval, default 10 min) and template-change events.

Never set memory limit below memory request

If the pod is OOMKilled it restarts, losing its cache. On restart it must re-sync everything — causing a CPU spike and temporary reconciliation lag. Leave at least 2× headroom above observed memory.

Applying the limits

Helm:

yaml
# values.yaml
resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi

Kustomize / direct manifest:

yaml
# config/manager/manager.yaml
containers:
  - name: manager
    resources:
      requests:
        cpu: 500m
        memory: 256Mi
      limits:
        cpu: 1000m
        memory: 512Mi

Scaling Projection

The accurate way to project resource usage is to estimate total managed objects — the sum across all LynqForms of (LynqNodes for that form × resources defined in that form).

total_managed = Σ (node_count_per_form × resources_per_form)

CPU (steady state)  ≈ total_managed × 0.55 m
Memory (warm cache) ≈ 30 MB + (total_managed × 0.1 MB)

Benchmark verification:

total_managed = (74 × 9) + (73 × 3) + (73 × 2) = 666 + 219 + 146 = 1,031

CPU  ≈ 1,031 × 0.55 = 567 m  → observed 500–600 m ✓
Mem  ≈ 30 + 1,031 × 0.1 = 133 MB → observed 100–140 MB ✓

Projection table (same 3-form density as benchmark, avg ~4.7 resources/node):

Active rowsLynqNodes (×3 forms)Total managedCPU (steady)Memory (warm)
~15~23~13 m~32 MB
~1750~235~130 m~54 MB
~73220~1,025~565 m~133 MB
~170500~2,345~1,290 m~265 MB
~3351,000~4,690~2,580 m~499 MB

These are steady-state estimates. Add 50–100% headroom for burst and reconciliation spikes.

Adjusting for Different Resource Densities

If your LynqForms define significantly more or fewer resources per node than the benchmark (~4.7 avg), the total_managed formula is the reliable path.

Example — heavier setup:

2 LynqForms, each with 15 resources per node. 100 active rows = 200 LynqNodes. total_managed = 2 × 100 × 15 = 3,000CPU ≈ 3,000 × 0.55 = 1,650 m — over 1.5 cores, not the ~250 m you'd get from a node-count-only estimate.

Example — lighter setup:

1 LynqForm with 3 resources per node. 500 active rows = 500 LynqNodes. total_managed = 500 × 3 = 1,500CPU ≈ 1,500 × 0.55 = 825 m — less than the 500-node row in the table above suggests.

Use kubectl top pod -n lynq-system early in your rollout to validate against the model.

What Goes Wrong with Under-sized Resources

CPU throttling

If the CPU limit is too low, the kernel rate-limits the container. Reconciliations slow down — you'll see longer reconcile durations in lynqnode_reconcile_duration_seconds without any errors. Everything still works, just slower.

Signal: container_cpu_cfs_throttled_seconds_total increases. Reconciliation P95 > 15 s.

Fix: raise the CPU limit, or reduce --node-concurrency to lower the burst peak.

Memory OOMKill

If the memory limit is too close to the working set, the pod is killed and restarted when the cache fills. After restart, the controller re-builds the cache from scratch — a cold-start reconciliation wave that temporarily spikes CPU.

Signal: pod restarts increasing (kube_pod_container_status_restarts_total). OOMKilled in kubectl describe pod.

Fix: raise the memory limit. A safe floor is:

memory limit ≥ 30 MB + (total_managed × 0.15 MB)

For the benchmark at ~1,025 managed objects: 30 + 1,025 × 0.15 ≈ 184 MB — which explains why the 200–500 node tier uses 256Mi as the request.

How to Verify Current Usage

bash
# Live resource usage
kubectl top pod -n lynq-system

# See configured limits and requests
kubectl get pod -n lynq-system -l control-plane=controller-manager \
  -o jsonpath='{range .items[*]}{.spec.containers[*].resources}{"\n"}{end}'

# Count total managed objects as a sizing check
kubectl get lynqnodes -A --no-headers | wc -l   # LynqNode count

Prometheus query for memory headroom:

promql
# Memory usage as % of limit (alert if > 80%)
container_memory_working_set_bytes{container="manager", namespace="lynq-system"}
/ container_spec_memory_limit_bytes{container="manager", namespace="lynq-system"} * 100

Concurrency and Resource Trade-off

Higher --node-concurrency increases burst CPU but decreases reconciliation latency. Lower concurrency reduces the burst peak but slows down mass reconciliations.

The table below is an estimate based on the benchmark environment (220 LynqNodes, ~1,025 managed objects). Actual values depend on your resource density and cluster API server latency.

--node-concurrencyEst. CPU at 220 nodesEst. time to reconcile all 220 nodes
5~300 m~60–90 s
10 (default)~550 m~30–45 s
20~900 m~15–25 s

If you see high CPU but nodes change infrequently, reduce concurrency before raising limits.

See Also

  • Performance Tuning — sync intervals, dependency depth, template optimization
  • Configuration--node-concurrency, --hub-concurrency, --form-concurrency flags
  • Monitoring — Prometheus metrics for reconcile duration and resource usage
  • Prometheus Queries — ready-to-use PromQL for sizing decisions