Performance Tuning Guide
Practical optimization strategies for scaling Lynq to hundreds of nodes.
Understanding Performance
Lynq uses three reconciliation layers:
- Event-Driven (Immediate): Reacts to resource changes instantly
- Periodic (30 seconds): Fast status updates and drift detection
- Database Sync (Configurable): Syncs node data at defined intervals
This architecture means drift is corrected within 30 seconds of detection, database changes are picked up at the configured syncInterval (default: 1 minute), and resource changes trigger immediate reconciliation.
Configuration Tuning
1. Database Sync Interval
Adjust how frequently the operator checks your database:
apiVersion: operator.lynq.sh/v1
kind: LynqHub
metadata:
name: my-hub
spec:
source:
syncInterval: 1m # Default: 1 minuteRecommendations:
- High-frequency changes:
30s- Faster node provisioning, higher DB load - Normal usage:
1m(default) - Balanced performance - Stable nodes:
5m- Lower DB load, slower updates
2. Resource Wait Timeouts
Control how long to wait for resources to become ready:
deployments:
- id: app
waitForReady: true
timeoutSeconds: 300 # Default: 5 minutes (max: 3600)Recommendations:
- Fast services:
60s- Quick deployments (< 1 min) - Normal apps:
300s(default) - Standard deployments - Heavy apps:
600s- Database migrations, complex initialization - Skip waiting: Set
waitForReady: falsefor non-critical resources
3. Creation Policy Optimization
Reduce unnecessary reconciliations:
configMaps:
- id: init-config
creationPolicy: Once # Create once, never reapplyUse Cases:
Once: Init scripts, immutable configs, security resourcesWhenNeeded(default): Normal resources that may need updates
Template Optimization
1. Keep Templates Simple
✅ Good - Efficient template:
nameTemplate: "{{ .uid }}-app"❌ Bad - Complex template:
nameTemplate: "{{ .uid }}-{{ .region }}-{{ .planId }}-{{ now | date \"20060102\" }}"
# Avoid: timestamps, random values, complex logicTips:
- Keep templates simple and predictable
- Avoid
now,randAlphaNum, or other non-deterministic functions - Use consistent naming patterns
- Cache-friendly templates improve performance
2. Dependency Graph Optimization
✅ Good - Shallow dependency tree:
resources:
- id: namespace # No dependencies
- id: deployment # Depends on: namespace
- id: service # Depends on: deployment
# Depth: 3 - Resources can be created in parallel groups❌ Bad - Deep dependency tree:
resources:
- id: a # No dependencies
- id: b # Depends on: a
- id: c # Depends on: b
- id: d # Depends on: c
- id: e # Depends on: d
# Depth: 5 - Fully sequential, slowImpact:
- Shallow trees enable parallel execution
- Deep trees force sequential execution
- Each level adds wait time
3. Minimize Resource Count
Example: Create 5 essential resources per node instead of 15
# Essential only
spec:
namespaces: [1]
deployments: [1]
services: [1]
configMaps: [1]
ingresses: [1]
# Total: 5 resourcesImpact:
- Fewer resources = Faster reconciliation
- Less API server load
- Lower memory usage
Scaling Considerations
Resource Limits
Quick reference based on real benchmark data (5 nodes: 0–5 m CPU / 10–60 MB RAM; 220 nodes: 500–600 m CPU / 100–140 MB RAM):
| Node count | CPU request | CPU limit | Memory request | Memory limit |
|---|---|---|---|---|
| < 50 | 50m | 200m | 64Mi | 128Mi |
| 50–200 | 200m | 500m | 128Mi | 256Mi |
| 200–500 | 500m | 1000m | 256Mi | 512Mi |
| 500–1000 | 1000m | 2000m | 512Mi | 1Gi |
| 1000+ | 2000m | 4000m | 1Gi | 2Gi |
For the explanation of why resources scale this way (cache model, reconciliation burst pattern, concurrency trade-offs), see Resource Sizing.
Database Optimization
- Add indexes to node table:
CREATE INDEX idx_is_active ON node_configs(is_active);
CREATE INDEX idx_node_id ON node_configs(node_id);Use read replicas for high-frequency syncs
Connection pooling: Operator uses persistent connections
Monitoring Performance
Key Metrics with Thresholds
Monitor these Prometheus metrics with specific thresholds:
| Metric | Target | Warning | Critical | Action |
|---|---|---|---|---|
| Reconciliation P95 | < 5s | 5-15s | > 15s | Simplify templates, reduce dependencies |
| Reconciliation P99 | < 15s | 15-30s | > 30s | Check for blocking resources |
| Node Ready Rate | > 98% | 95-98% | < 95% | Check failed nodes, resource issues |
| Error Rate | < 1% | 1-5% | > 5% | Investigate operator logs |
| Skipped Resources | 0 | 1-5 | > 5 | Fix dependency failures |
| Conflict Count | 0 | 1-3 | > 3 | Review resource ownership |
| CPU Usage | < 50% | 50-80% | > 80% | Increase limits or reduce concurrency |
| Memory Usage | < 70% | 70-90% | > 90% | Increase limits or restart operator |
| Hub Sync Duration | < 1s | 1-5s | > 5s | Optimize DB query, add indexes |
Prometheus Queries:
# Reconciliation duration P95 (target: < 5s)
histogram_quantile(0.95,
sum(rate(lynqnode_reconcile_duration_seconds_bucket[5m])) by (le)
)
# Reconciliation duration P99 (target: < 15s)
histogram_quantile(0.99,
sum(rate(lynqnode_reconcile_duration_seconds_bucket[5m])) by (le)
)
# Node readiness rate (target: > 98%)
sum(lynqnode_resources_ready) / sum(lynqnode_resources_desired) * 100
# Error rate (target: < 1%)
sum(rate(lynqnode_reconcile_duration_seconds_count{result="error"}[5m]))
/ sum(rate(lynqnode_reconcile_duration_seconds_count[5m])) * 100
# Conflict count by node
sum by (lynqnode) (lynqnode_resources_conflicted)
# Degraded nodes
count(lynqnode_degraded_status == 1)Sample Prometheus Alert Rules
# config/prometheus/alerts.yaml
groups:
- name: lynq-performance
rules:
- alert: LynqSlowReconciliation
expr: histogram_quantile(0.95, sum(rate(lynqnode_reconcile_duration_seconds_bucket[5m])) by (le)) > 15
for: 5m
labels:
severity: warning
annotations:
summary: "Lynq reconciliation is slow"
description: "P95 reconciliation time is {{ $value | humanizeDuration }}"
- alert: LynqHighErrorRate
expr: |
sum(rate(lynqnode_reconcile_duration_seconds_count{result="error"}[5m]))
/ sum(rate(lynqnode_reconcile_duration_seconds_count[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Lynq error rate above 5%"
description: "Current error rate: {{ $value | humanizePercentage }}"
- alert: LynqLowReadyRate
expr: sum(lynqnode_resources_ready) / sum(lynqnode_resources_desired) < 0.95
for: 10m
labels:
severity: warning
annotations:
summary: "Lynq ready rate below 95%"
description: "Only {{ $value | humanizePercentage }} resources are ready"
- alert: LynqHighMemory
expr: |
container_memory_usage_bytes{container="manager", namespace="lynq-system"}
/ container_spec_memory_limit_bytes{container="manager", namespace="lynq-system"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Lynq operator memory above 90%"
description: "Memory usage: {{ $value | humanizePercentage }}"See Monitoring Guide for complete metrics reference.
Bottleneck Identification Priority
When performance degrades, identify bottlenecks in this order:
Performance Issue?
│
├─ Step 1: Check Reconciliation Duration
│ $ kubectl logs -n lynq-system deployment/lynq-controller-manager | grep "Reconciliation completed" | tail -20
│ │
│ ├─ > 15s? → Check dependency depth, waitForReady timeouts
│ └─ < 5s? → Continue to Step 2
│
├─ Step 2: Check Hub Sync Duration
│ $ kubectl get lynqhub -o jsonpath='{range .items[*]}{.metadata.name}: {.status.lastSyncDuration}{"\n"}{end}'
│ │
│ ├─ > 5s? → Optimize DB query, add indexes
│ └─ < 1s? → Continue to Step 3
│
├─ Step 3: Check Resource Usage
│ $ kubectl top pods -n lynq-system
│ │
│ ├─ CPU > 80%? → Increase limits or reduce concurrency
│ └─ Memory > 90%? → Increase limits or restart operator
│
└─ Step 4: Check Error Rate
$ kubectl logs -n lynq-system deployment/lynq-controller-manager | grep -c ERROR
│
├─ High error count? → Check specific error messages
└─ Low errors? → Performance may be within normal rangeTroubleshooting Slow Performance
Symptom: Slow Node Creation
Check:
- Database query performance
waitForReadytimeouts- Dependency chain depth
Solution:
# Check reconciliation times
kubectl logs -n lynq-system -l control-plane=controller-manager | grep "Reconciliation completed"
# Reduce sync interval if database is slow
kubectl patch lynqhub my-hub --type=merge -p '{"spec":{"source":{"syncInterval":"2m"}}}'Symptom: High CPU Usage
Check:
- Reconciliation frequency
- Template complexity
- Total node count
Solution:
# Check CPU usage
kubectl top pods -n lynq-system
# Increase resource limits
kubectl edit deployment -n lynq-system lynq-controller-managerSymptom: Memory Growth
Possible causes:
- Too many cached resources
- Large template outputs
- Memory leak (file an issue)
Solution:
# Restart operator to clear cache
kubectl rollout restart deployment -n lynq-system lynq-controller-manager
# Monitor memory over time
kubectl top pods -n lynq-system --watchBest Practices Summary
- ✅ Start with defaults - Only optimize if you see issues
- ✅ Keep templates simple - Avoid complex logic and non-deterministic functions
- ✅ Use shallow dependency trees - Enable parallel resource creation
- ✅ Set appropriate timeouts - Balance speed vs reliability
- ✅ Monitor key metrics - Watch reconciliation duration and error rates
- ✅ Index your database - Improve sync query performance
- ✅ Use
CreationPolicy: Once- For immutable resources
See Also
- Monitoring Guide - Complete metrics reference and dashboards
- Prometheus Queries - Ready-to-use queries
- Configuration Guide - All operator settings
- Troubleshooting Guide - Common issues and solutions
