Skip to content
GitHub stars

Alert Runbooks ​

Comprehensive troubleshooting guide for Lynq alerts. Each alert includes diagnosis steps and resolution procedures.

Overview ​

This page provides detailed runbooks for all Lynq alerts, organized by severity:

  • Critical Alerts: Require immediate action - production impact
  • Warning Alerts: Require investigation - potential issues
  • Info Alerts: Informational - awareness only

Quick Navigation

Use the table of contents (right sidebar) to jump directly to a specific alert.


Critical Alerts ​

LynqNodeDegraded ​

Alert Name: LynqNodeDegradedSeverity: Critical Threshold: lynqnode_degraded_status > 0 for 5+ minutes

Description ​

LynqNode CR has entered a degraded state, indicating the operator cannot successfully reconcile the node's resources. This is a critical condition preventing normal node operation.

Symptoms ​

  • LynqNode's Ready condition is False
  • status.degraded shows specific degradation reason
  • Resources may be partially applied or stuck
  • Events show template, conflict, or dependency errors

Possible Causes ​

  1. Template Rendering Errors - Missing variables, syntax errors, type conversion failures
  2. Dependency Cycles - Circular dependencies between resources
  3. Resource Conflicts - Resources exist with different owners (ConflictPolicy: Stuck)
  4. External Dependencies - Missing ConfigMaps, Secrets, or Namespaces
  5. RBAC Issues - Operator lacks permissions

Diagnosis ​

bash
# Check node status
kubectl get lynqnode <lynqnode-name> -n <namespace> -o yaml

# Review events
kubectl describe lynqnode <lynqnode-name> -n <namespace>

# Check operator logs
kubectl logs node-name>

# Validate template variables
kubectl get lynqnode <lynqnode-name> -n <namespace> -o jsonpath='{.metadata.annotations}'

Resolution ​

For Template Errors:

bash
# Check LynqHub extraValueMappings
kubectl get lynqhub <hub-name> -o yaml

# Verify LynqForm syntax
kubectl get lynqform <template-name> -o yaml

For Dependency Cycles:

bash
# Review dependIds configuration
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.spec.*.dependIds}'

# Remove circular dependencies in template
kubectl edit lynqform <template-name>

For Resource Conflicts:

bash
# Check conflicting resources
kubectl get <resource-type> <resource-name> -o yaml | grep -A 10 metadata

# Option 1: Delete conflicting resource
kubectl delete <resource-type> <resource-name>

# Option 2: Change ConflictPolicy to Force
kubectl edit lynqform <template-name>

LynqNodeResourcesFailed ​

Alert Name: LynqNodeResourcesFailedSeverity: Critical Threshold: lynqnode_resources_failed > 0 for 5+ minutes

Description ​

Node has one or more resources that failed to apply or became unhealthy, indicating critical provisioning failure.

Symptoms ​

  • status.failedResources > 0
  • Specific resources not created or in error state
  • Events show apply failures or timeout errors

Diagnosis ​

bash
# Check failed resources count
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.failedResources}'

# List applied resources
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.appliedResources}'

# Check events for failures
kubectl get events --field-selector involvedObject.kind=LynqNode,involvedObject.name=<lynqnode-name>

Resolution ​

For Apply Failures:

bash
# Check RBAC permissions
kubectl auth can-i create <resource-type> --as=system:serviceaccount:lynq-system:lynq

# Review resource spec in template
kubectl get lynqform <template-name> -o yaml

For Readiness Timeouts:

bash
# Increase timeout
kubectl edit lynqform <template-name>
# Set: timeoutSeconds: 600

# Or disable readiness wait
# Set: waitForReady: false

LynqNodeNotReady ​

Alert Name: LynqNodeNotReadySeverity: Critical Threshold: lynqnode_condition_status{type="Ready"} == 0 for 15+ minutes

Description ​

Node has not reached Ready state for an extended period, indicating persistent provisioning issues.

Symptoms ​

  • Ready condition is False for 15+ minutes
  • Resources may be pending, creating, or failing health checks
  • LynqNode status shows ongoing reconciliation

Diagnosis ​

bash
# Check Ready condition
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'

# Check resource readiness
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.readyResources}/{.status.desiredResources}'

# Identify slow resources
kubectl get all -l lynq.sh/lynqnode=<lynqnode-name>

Resolution ​

bash
# Check if resources are progressing
kubectl describe <resource-type> <resource-name>

# Review readiness probes
kubectl get pod <pod-name> -o yaml | grep -A 10 readinessProbe

# Check dependencies
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.spec.*.dependIds}'

LynqNodeStatusUnknown ​

Alert Name: LynqNodeStatusUnknownSeverity: Critical Threshold: lynqnode_condition_status{type="Ready"} == 2 for 10+ minutes

Description ​

LynqNode status is Unknown, indicating potential controller or API server communication issues.

Symptoms ​

  • Ready condition status is Unknown
  • Status updates not propagating
  • Controller may be unreachable or crashed

Diagnosis ​

bash
# Check controller pods
kubectl get pods -n lynq-system

# Check controller logs
kubectl logs -n lynq-system -l control-plane=controller-manager --tail=100

# Check API server connectivity
kubectl get --raw /healthz

Resolution ​

bash
# Restart controller if unhealthy
kubectl rollout restart deployment -n lynq-system lynq-controller-manager

# Check for resource pressure
kubectl top pods -n lynq-system

# Review recent changes
kubectl rollout history deployment -n lynq-system lynq-controller-manager

HubManyNodesFailure ​

Alert Name: HubManyNodesFailureSeverity: Critical Threshold: hub_failed > 5 or hub_failed / hub_desired > 0.5 for 5+ minutes

Description ​

Hub has widespread node failures (>5 lynqnodes or >50% failure rate), indicating systemic issue affecting multiple lynqnodes.

Symptoms ​

  • High number of failed lynqnodes in hub
  • Multiple lynqnodes showing similar errors
  • Pattern of failures across all lynqnodes

Diagnosis ​

bash
# Check hub status
kubectl get lynqhub <hub-name> -o yaml

# List failed nodes
kubectl get lynqnodes -l operator.lynq.sh/hub=<hub-name> \
  --field-selector status.phase=Failed

# Check database connectivity
kubectl logs -n lynq-system -l control-plane=controller-manager | grep "database\|mysql"

Resolution ​

For Database Issues:

bash
# Verify database connectivity
kubectl get secret <db-secret> -o yaml

# Check database availability
kubectl run mysql-test --rm -it --image=mysql:8 -- \
  mysql -h <db-host> -u <db-user> -p<db-password> -e "SELECT 1"

# Review hub sync interval
kubectl get lynqhub <hub-name> -o jsonpath='{.spec.source.syncInterval}'

For Template Issues:

bash
# Check template validity
kubectl get lynqform -l operator.lynq.sh/hub=<hub-name>

# Review template syntax
kubectl get lynqform <template-name> -o yaml

# Validate template rendering
kubectl describe lynqnode <any-failed-node>

Warning Alerts ​

LynqNodeResourcesMismatch ​

Alert Name: LynqNodeResourcesMismatchSeverity: Warning Threshold: lynqnode_resources_ready != lynqnode_resources_desired (no failures) for 15+ minutes

Description ​

LynqNode's ready resource count doesn't match desired count, but no failures are detected. Reconciliation may be stuck or slow.

Diagnosis ​

bash
# Check resource counts
kubectl get lynqnode <lynqnode-name> -o jsonpath='Ready: {.status.readyResources}, Desired: {.status.desiredResources}, Failed: {.status.failedResources}'

# Check if resources are progressing
kubectl get all -l lynq.sh/lynqnode=<lynqnode-name>

Resolution ​

bash
# Check for pending resources
kubectl get events --field-selector involvedObject.name=<resource-name>

# Verify dependencies are satisfied
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.spec.*.dependIds}'

# Force reconciliation
kubectl annotate lynqnode <lynqnode-name> operator.lynq.sh/reconcile="$(date +%s)" --overwrite

LynqNodeResourcesConflicted ​

Alert Name: LynqNodeResourcesConflictedSeverity: Warning Threshold: lynqnode_resources_conflicted > 0 for 10+ minutes

Description ​

Node has resources in conflict state, usually indicating ownership conflicts with existing resources.

Diagnosis ​

bash
# Check conflicted resources
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.conflictedResources}'

# Check conflict count
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status.resourcesConflicted}'

# Review conflict events
kubectl describe lynqnode <lynqnode-name> | grep Conflict

Resolution ​

bash
# Identify conflicting resources
kubectl get events --field-selector reason=ResourceConflict

# Option 1: Delete conflicting resources
kubectl delete <resource-type> <resource-name>

# Option 2: Use unique naming
kubectl edit lynqform <template-name>
# Update nameTemplate: "{{ .uid }}-{{ .planId }}-app"

# Option 3: Change to Force policy
kubectl edit lynqform <template-name>
# Set: conflictPolicy: Force

LynqNodeHighConflictRate ​

Alert Name: LynqNodeHighConflictRateSeverity: Warning Threshold: rate(lynqnode_conflicts_total[5m]) > 0.1 for 10+ minutes

Description ​

High rate of conflicts detected, indicating recurring ownership or naming issues.

Diagnosis ​

bash
# Check conflict rate
kubectl get --raw /metrics | grep lynqnode_conflicts_total

# Identify conflict patterns
kubectl logs -n lynq-system -l control-plane=controller-manager | grep -i conflict

Resolution ​

bash
# Review naming templates
kubectl get lynqform <template-name> -o yaml | grep nameTemplate

# Ensure unique names per node
# Use: nameTemplate: "{{ .uid }}-{{ sha1sum .host | trunc 8 }}-app"

# Consider Force policy if appropriate
kubectl patch lynqform <template-name> --type=merge -p '{"spec":{"deployments":[{"conflictPolicy":"Force"}]}}'

HubNodesFailure ​

Alert Name: HubNodesFailureSeverity: Warning Threshold: 0 < hub_failed <= 5 for 10+ minutes

Description ​

Hub has some failed nodes (1-5), indicating isolated provisioning issues.

Diagnosis ​

bash
# List failed nodes
kubectl get lynqnodes -l operator.lynq.sh/hub=<hub-name> \
  | grep -v "True"

# Check specific node
kubectl describe lynqnode <failed-lynqnode-name>

Resolution ​

bash
# Investigate individual node failures
kubectl logs -n lynq-system -l control-plane=controller-manager | grep <failed-lynqnode-name>

# Check node-specific data
kubectl get lynqnode <failed-lynqnode-name> -o yaml

# Verify database row
# (Connect to database and check node_id row)

HubDesiredCountMismatch ​

Alert Name: HubDesiredCountMismatchSeverity: Warning Threshold: hub_ready != hub_desired (no failures) for 20+ minutes

Description ​

Hub's ready node count doesn't match desired count, but no failures detected. Sync may be delayed.

Diagnosis ​

bash
# Check hub status
kubectl get lynqhub <hub-name> -o jsonpath='Desired: {.status.desired}, Ready: {.status.ready}, Failed: {.status.failed}'

# List all lynqnodes
kubectl get lynqnodes -l operator.lynq.sh/hub=<hub-name>

# Check sync interval
kubectl get lynqhub <hub-name> -o jsonpath='{.spec.source.syncInterval}'

Resolution ​

bash
# Force hub sync
kubectl annotate lynqhub <hub-name> operator.lynq.sh/sync="$(date +%s)" --overwrite

# Check database for new rows
# Verify activate=true for expected nodes

# Review hub controller logs
kubectl logs -n lynq-system -l control-plane=controller-manager | grep "registry.*<hub-name>"

LynqNodeReconciliationErrors ​

Alert Name: LynqNodeReconciliationErrorsSeverity: Warning Threshold: Error rate > 10% for 10+ minutes

Description ​

High error rate in node reconciliations, indicating controller issues, API problems, or resource contention.

Diagnosis ​

bash
# Check error rate
kubectl get --raw /metrics | grep 'lynqnode_reconcile_duration_seconds_count{result="error"}'

# Review controller logs for errors
kubectl logs -n lynq-system -l control-plane=controller-manager --tail=200 | grep -i error

# Check API server health
kubectl get --raw /healthz
kubectl get --raw /readyz

Resolution ​

bash
# Check controller resource usage
kubectl top pods -n lynq-system

# Increase controller resources if needed
kubectl edit deployment -n lynq-system lynq-controller-manager

# Review concurrent reconciliation settings
kubectl get deployment -n lynq-system lynq-controller-manager -o yaml | grep concurrent

LynqNodeReconciliationSlow ​

Alert Name: LynqNodeReconciliationSlowSeverity: Warning Threshold: P95 duration > 30s for 15+ minutes

Description ​

Slow reconciliation detected (P95 > 30s), indicating performance issues, resource contention, or complex configurations.

Diagnosis ​

bash
# Check reconciliation duration
kubectl get --raw /metrics | grep lynqnode_reconcile_duration_seconds

# Identify slow nodes
kubectl get lynqnodes --sort-by='.status.lastReconcileTime'

# Check for large templates
kubectl get lynqforms -o json | jq '.items[] | {name: .metadata.name, resources: (.spec | [.deployments, .services, .configMaps] | flatten | length)}'

Resolution ​

bash
# Optimize template complexity
# - Reduce resource count per node
# - Use efficient dependency chains
# - Avoid unnecessary waitForReady

# Increase concurrency
kubectl patch deployment -n lynq-system lynq-controller-manager \
  --type=json -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--node-concurrency=20"}]'

# Consider sharding by namespace
# Deploy multiple operators with namespace filters

HighApplyFailureRate ​

Alert Name: HighApplyFailureRateSeverity: Warning Threshold: Apply failure rate > 20% for 10+ minutes

Description ​

High failure rate for resource applies, indicating template issues or RBAC permission problems.

Diagnosis ​

bash
# Check apply metrics
kubectl get --raw /metrics | grep apply_attempts_total

# Identify failing resource types
kubectl logs -n lynq-system -l control-plane=controller-manager | grep "Failed to apply"

# Check RBAC for resource types
kubectl auth can-i create deployment --as=system:serviceaccount:lynq-system:lynq

Resolution ​

bash
# Verify RBAC permissions
kubectl describe clusterrole lynq-role

# Add missing permissions
kubectl edit clusterrole lynq-role

# Validate resource templates
kubectl get lynqform <template-name> -o yaml

# Check for API deprecations
kubectl api-resources | grep <resource-kind>

Info Alerts ​

LynqNodeNewConflictsDetected ​

Alert Name: LynqNodeNewConflictsDetectedSeverity: Info Threshold: increase(lynqnode_conflicts_total[5m]) > 0 for 2+ minutes

Description ​

New conflicts detected in the last 5 minutes. Informational alert for conflict awareness.

Diagnosis ​

bash
# Check recent conflicts
kubectl get events --sort-by='.lastTimestamp' | grep Conflict | head -20

# View conflict details
kubectl describe lynqnode <lynqnode-name> | grep -A 5 Conflict

Resolution ​

If conflicts persist, escalate to LynqNodeResourcesConflicted or LynqNodeHighConflictRate resolution procedures.


General Troubleshooting Tips ​

Quick Diagnostic Commands ​

bash
# Overall operator health
kubectl get pods -n lynq-system
kubectl top pods -n lynq-system

# All node statuses
kubectl get lynqnodes -A

# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -50

# Operator logs (last 1 hour)
kubectl logs -n lynq-system -l control-plane=controller-manager --since=1h

Common Fixes ​

  1. Force Reconciliation:

    bash
    kubectl annotate lynqnode <name> operator.lynq.sh/reconcile="$(date +%s)" --overwrite
  2. Restart Controller:

    bash
    kubectl rollout restart deployment -n lynq-system lynq-controller-manager
  3. Validate Configuration:

    bash
    kubectl get lynqhub,lynqform -A -o wide

When to Escalate ​

  • Multiple critical alerts firing simultaneously
  • Repeated failures after following runbook procedures
  • Suspected operator bug or API server issues
  • Database connectivity or performance problems

See Also ​

Released under the Apache 2.0 License.
Built with ❀️ using Kubebuilder, Controller-Runtime, and VitePress.