Alert Runbooks
Comprehensive troubleshooting guide for Lynq alerts. Each alert includes diagnosis steps and resolution procedures.
Overview
This page provides detailed runbooks for all Lynq alerts, organized by severity:
- Critical Alerts: Require immediate action - production impact
- Warning Alerts: Require investigation - potential issues
- Info Alerts: Informational - awareness only
Quick Navigation
Use the table of contents (right sidebar) to jump directly to a specific alert.
Critical Alerts
LynqNodeDegraded
Alert Name: LynqNodeDegradedSeverity: Critical Threshold: lynqnode_degraded_status > 0 for 5+ minutes
Description
LynqNode CR has entered a degraded state, indicating the operator cannot successfully reconcile the node's resources. This is a critical condition preventing normal node operation.
Symptoms
- LynqNode's
Readycondition isFalse status.degradedshows specific degradation reason- Resources may be partially applied or stuck
- Events show template, conflict, or dependency errors
Possible Causes
- Template Rendering Errors - Missing variables, syntax errors, type conversion failures
- Dependency Cycles - Circular dependencies between resources
- Resource Conflicts - Resources exist with different owners (
ConflictPolicy: Stuck) - External Dependencies - Missing ConfigMaps, Secrets, or Namespaces
- RBAC Issues - Operator lacks permissions
Diagnosis
# Check node status
kubectl get lynqnode <lynqnode-name> -n <namespace> -o yaml
# Review events
kubectl describe lynqnode <lynqnode-name> -n <namespace>
# Check operator logs
kubectl logs node-name>
# Validate template variables
kubectl get lynqnode <lynqnode-name> -n <namespace> -o jsonpath='{.metadata.annotations}'Resolution
For Template Errors:
# Check LynqHub extraValueMappings
kubectl get lynqhub <hub-name> -o yaml
# Verify LynqForm syntax
kubectl get lynqform <template-name> -o yamlFor Dependency Cycles:
# Review dependIds configuration
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.spec.*.dependIds}'
# Remove circular dependencies in template
kubectl edit lynqform <template-name>For Resource Conflicts:
# Check conflicting resources
kubectl get <resource-type> <resource-name> -o yaml | grep -A 10 metadata
# Option 1: Delete conflicting resource
kubectl delete <resource-type> <resource-name>
# Option 2: Change ConflictPolicy to Force
kubectl edit lynqform <template-name>LynqNodeResourcesFailed
Alert Name: LynqNodeResourcesFailedSeverity: Critical Threshold: lynqnode_resources_failed > 0 for 5+ minutes
Description
Node has one or more resources that failed to apply or became unhealthy, indicating critical provisioning failure.
Symptoms
status.failedResources > 0- Specific resources not created or in error state
- Events show apply failures or timeout errors
Diagnosis
# Check failed resources count
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.failedResources}'
# List applied resources
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.appliedResources}'
# Check events for failures
kubectl get events --field-selector involvedObject.kind=LynqNode,involvedObject.name=<lynqnode-name>Resolution
For Apply Failures:
# Check RBAC permissions
kubectl auth can-i create <resource-type> --as=system:serviceaccount:lynq-system:lynq
# Review resource spec in template
kubectl get lynqform <template-name> -o yamlFor Readiness Timeouts:
# Increase timeout
kubectl edit lynqform <template-name>
# Set: timeoutSeconds: 600
# Or disable readiness wait
# Set: waitForReady: falseLynqNodeNotReady
Alert Name: LynqNodeNotReadySeverity: Critical Threshold: lynqnode_condition_status{type="Ready"} == 0 for 15+ minutes
Description
Node has not reached Ready state for an extended period, indicating persistent provisioning issues.
Symptoms
Readycondition isFalsefor 15+ minutes- Resources may be pending, creating, or failing health checks
- LynqNode status shows ongoing reconciliation
Diagnosis
# Check Ready condition
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
# Check resource readiness
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.readyResources}/{.status.desiredResources}'
# Identify slow resources
kubectl get all -l lynq.sh/lynqnode=<lynqnode-name>Resolution
# Check if resources are progressing
kubectl describe <resource-type> <resource-name>
# Review readiness probes
kubectl get pod <pod-name> -o yaml | grep -A 10 readinessProbe
# Check dependencies
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.spec.*.dependIds}'LynqNodeStatusUnknown
Alert Name: LynqNodeStatusUnknownSeverity: Critical Threshold: lynqnode_condition_status{type="Ready"} == 2 for 10+ minutes
Description
LynqNode status is Unknown, indicating potential controller or API server communication issues.
Symptoms
Readycondition status isUnknown- Status updates not propagating
- Controller may be unreachable or crashed
Diagnosis
# Check controller pods
kubectl get pods -n lynq-system
# Check controller logs
kubectl logs -n lynq-system -l control-plane=controller-manager --tail=100
# Check API server connectivity
kubectl get --raw /healthzResolution
# Restart controller if unhealthy
kubectl rollout restart deployment -n lynq-system lynq-controller-manager
# Check for resource pressure
kubectl top pods -n lynq-system
# Review recent changes
kubectl rollout history deployment -n lynq-system lynq-controller-managerHubManyNodesFailure
Alert Name: HubManyNodesFailureSeverity: Critical Threshold: hub_failed > 5 or hub_failed / hub_desired > 0.5 for 5+ minutes
Description
Hub has widespread node failures (>5 lynqnodes or >50% failure rate), indicating systemic issue affecting multiple lynqnodes.
Symptoms
- High number of failed lynqnodes in hub
- Multiple lynqnodes showing similar errors
- Pattern of failures across all lynqnodes
Diagnosis
# Check hub status
kubectl get lynqhub <hub-name> -o yaml
# List failed nodes
kubectl get lynqnodes -l operator.lynq.sh/hub=<hub-name> \
--field-selector status.phase=Failed
# Check database connectivity
kubectl logs -n lynq-system -l control-plane=controller-manager | grep "database\|mysql"Resolution
For Database Issues:
# Verify database connectivity
kubectl get secret <db-secret> -o yaml
# Check database availability
kubectl run mysql-test --rm -it --image=mysql:8 -- \
mysql -h <db-host> -u <db-user> -p<db-password> -e "SELECT 1"
# Review hub sync interval
kubectl get lynqhub <hub-name> -o jsonpath='{.spec.source.syncInterval}'For Template Issues:
# Check template validity
kubectl get lynqform -l operator.lynq.sh/hub=<hub-name>
# Review template syntax
kubectl get lynqform <template-name> -o yaml
# Validate template rendering
kubectl describe lynqnode <any-failed-node>Warning Alerts
LynqNodeResourcesMismatch
Alert Name: LynqNodeResourcesMismatchSeverity: Warning Threshold: lynqnode_resources_ready != lynqnode_resources_desired (no failures) for 15+ minutes
Description
LynqNode's ready resource count doesn't match desired count, but no failures are detected. Reconciliation may be stuck or slow.
Diagnosis
# Check resource counts
kubectl get lynqnode <lynqnode-name> -o jsonpath='Ready: {.status.readyResources}, Desired: {.status.desiredResources}, Failed: {.status.failedResources}'
# Check if resources are progressing
kubectl get all -l lynq.sh/lynqnode=<lynqnode-name>Resolution
# Check for pending resources
kubectl get events --field-selector involvedObject.name=<resource-name>
# Verify dependencies are satisfied
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.spec.*.dependIds}'
# Force reconciliation
kubectl annotate lynqnode <lynqnode-name> operator.lynq.sh/reconcile="$(date +%s)" --overwriteLynqNodeResourcesConflicted
Alert Name: LynqNodeResourcesConflictedSeverity: Warning Threshold: lynqnode_resources_conflicted > 0 for 10+ minutes
Description
Node has resources in conflict state, usually indicating ownership conflicts with existing resources.
Diagnosis
# Check conflicted resources
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.conflictedResources}'
# Check conflict count
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.resourcesConflicted}'
# Review conflict events
kubectl describe lynqnode <lynqnode-name> | grep ConflictResolution
# Identify conflicting resources
kubectl get events --field-selector reason=ResourceConflict
# Option 1: Delete conflicting resources
kubectl delete <resource-type> <resource-name>
# Option 2: Use unique naming
kubectl edit lynqform <template-name>
# Update nameTemplate: "{{ .uid }}-{{ .planId }}-app"
# Option 3: Change to Force policy
kubectl edit lynqform <template-name>
# Set: conflictPolicy: ForceLynqNodeHighConflictRate
Alert Name: LynqNodeHighConflictRateSeverity: Warning Threshold: rate(lynqnode_conflicts_total[5m]) > 0.1 for 10+ minutes
Description
High rate of conflicts detected, indicating recurring ownership or naming issues.
Diagnosis
# Check conflict rate
kubectl get --raw /metrics | grep lynqnode_conflicts_total
# Identify conflict patterns
kubectl logs -n lynq-system -l control-plane=controller-manager | grep -i conflictResolution
# Review naming templates
kubectl get lynqform <template-name> -o yaml | grep nameTemplate
# Ensure unique names per node
# Use: nameTemplate: "{{ .uid }}-{{ sha1sum .host | trunc 8 }}-app"
# Consider Force policy if appropriate
kubectl patch lynqform <template-name> --type=merge -p '{"spec":{"deployments":[{"conflictPolicy":"Force"}]}}'HubNodesFailure
Alert Name: HubNodesFailureSeverity: Warning Threshold: 0 < hub_failed <= 5 for 10+ minutes
Description
Hub has some failed nodes (1-5), indicating isolated provisioning issues.
Diagnosis
# List failed nodes
kubectl get lynqnodes -l operator.lynq.sh/hub=<hub-name> \
| grep -v "True"
# Check specific node
kubectl describe lynqnode <failed-lynqnode-name>Resolution
# Investigate individual node failures
kubectl logs -n lynq-system -l control-plane=controller-manager | grep <failed-lynqnode-name>
# Check node-specific data
kubectl get lynqnode <failed-lynqnode-name> -o yaml
# Verify database row
# (Connect to database and check node_id row)HubDesiredCountMismatch
Alert Name: HubDesiredCountMismatchSeverity: Warning Threshold: hub_ready != hub_desired (no failures) for 20+ minutes
Description
Hub's ready node count doesn't match desired count, but no failures detected. Sync may be delayed.
Diagnosis
# Check hub status
kubectl get lynqhub <hub-name> -o jsonpath='Desired: {.status.desired}, Ready: {.status.ready}, Failed: {.status.failed}'
# List all lynqnodes
kubectl get lynqnodes -l operator.lynq.sh/hub=<hub-name>
# Check sync interval
kubectl get lynqhub <hub-name> -o jsonpath='{.spec.source.syncInterval}'Resolution
# Force hub sync
kubectl annotate lynqhub <hub-name> operator.lynq.sh/sync="$(date +%s)" --overwrite
# Check database for new rows
# Verify activate=true for expected nodes
# Review hub controller logs
kubectl logs -n lynq-system -l control-plane=controller-manager | grep "registry.*<hub-name>"LynqNodeReconciliationErrors
Alert Name: LynqNodeReconciliationErrorsSeverity: Warning Threshold: Error rate > 10% for 10+ minutes
Description
High error rate in node reconciliations, indicating controller issues, API problems, or resource contention.
Diagnosis
# Check error rate
kubectl get --raw /metrics | grep 'lynqnode_reconcile_duration_seconds_count{result="error"}'
# Review controller logs for errors
kubectl logs -n lynq-system -l control-plane=controller-manager --tail=200 | grep -i error
# Check API server health
kubectl get --raw /healthz
kubectl get --raw /readyzResolution
# Check controller resource usage
kubectl top pods -n lynq-system
# Increase controller resources if needed
kubectl edit deployment -n lynq-system lynq-controller-manager
# Review concurrent reconciliation settings
kubectl get deployment -n lynq-system lynq-controller-manager -o yaml | grep concurrentLynqNodeReconciliationSlow
Alert Name: LynqNodeReconciliationSlowSeverity: Warning Threshold: P95 duration > 30s for 15+ minutes
Description
Slow reconciliation detected (P95 > 30s), indicating performance issues, resource contention, or complex configurations.
Diagnosis
# Check reconciliation duration
kubectl get --raw /metrics | grep lynqnode_reconcile_duration_seconds
# Identify slow nodes
kubectl get lynqnodes --sort-by='.status.lastReconcileTime'
# Check for large templates
kubectl get lynqforms -o json | jq '.items[] | {name: .metadata.name, resources: (.spec | [.deployments, .services, .configMaps] | flatten | length)}'Resolution
# Optimize template complexity
# - Reduce resource count per node
# - Use efficient dependency chains
# - Avoid unnecessary waitForReady
# Increase concurrency
kubectl patch deployment -n lynq-system lynq-controller-manager \
--type=json -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--node-concurrency=20"}]'
# Consider sharding by namespace
# Deploy multiple operators with namespace filtersHighApplyFailureRate
Alert Name: HighApplyFailureRateSeverity: Warning Threshold: Apply failure rate > 20% for 10+ minutes
Description
High failure rate for resource applies, indicating template issues or RBAC permission problems.
Diagnosis
# Check apply metrics
kubectl get --raw /metrics | grep apply_attempts_total
# Identify failing resource types
kubectl logs -n lynq-system -l control-plane=controller-manager | grep "Failed to apply"
# Check RBAC for resource types
kubectl auth can-i create deployment --as=system:serviceaccount:lynq-system:lynqResolution
# Verify RBAC permissions
kubectl describe clusterrole lynq-role
# Add missing permissions
kubectl edit clusterrole lynq-role
# Validate resource templates
kubectl get lynqform <template-name> -o yaml
# Check for API deprecations
kubectl api-resources | grep <resource-kind>Info Alerts
LynqNodeNewConflictsDetected
Alert Name: LynqNodeNewConflictsDetectedSeverity: Info Threshold: increase(lynqnode_conflicts_total[5m]) > 0 for 2+ minutes
Description
New conflicts detected in the last 5 minutes. Informational alert for conflict awareness.
Diagnosis
# Check recent conflicts
kubectl get events --sort-by='.lastTimestamp' | grep Conflict | head -20
# View conflict details
kubectl describe lynqnode <lynqnode-name> | grep -A 5 ConflictResolution
If conflicts persist, escalate to LynqNodeResourcesConflicted or LynqNodeHighConflictRate resolution procedures.
General Troubleshooting Tips
Quick Diagnostic Commands
# Overall operator health
kubectl get pods -n lynq-system
kubectl top pods -n lynq-system
# All node statuses
kubectl get lynqnodes -A
# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
# Operator logs (last 1 hour)
kubectl logs -n lynq-system -l control-plane=controller-manager --since=1hCommon Fixes
Force Reconciliation:
bashkubectl annotate lynqnode <name> operator.lynq.sh/reconcile="$(date +%s)" --overwriteRestart Controller:
bashkubectl rollout restart deployment -n lynq-system lynq-controller-managerValidate Configuration:
bashkubectl get lynqhub,lynqform -A -o wide
When to Escalate
- Multiple critical alerts firing simultaneously
- Repeated failures after following runbook procedures
- Suspected operator bug or API server issues
- Database connectivity or performance problems
