Troubleshooting Guide
Common issues and solutions for Lynq.
General Debugging
Check Operator Status
# Check if operator is running
kubectl get pods -n lynq-system
# View operator logs
kubectl logs -n lynq-system deployment/lynq-controller-manager -f
# Check operator events
kubectl get events -n lynq-system --sort-by='.lastTimestamp'Check CRD Status
# List all LynqNode CRs
kubectl get lynqnodes --all-namespaces
# Describe a specific LynqNode
kubectl describe lynqnode <lynqnode-name>
# Get LynqNode status
kubectl get lynqnode <lynqnode-name> -o jsonpath='{.status}'Common Issues
1. Webhook TLS Certificate Errors
Error:
open /tmp/k8s-webhook-server/serving-certs/tls.crt: no such file or directoryCause: Webhook TLS certificates not found. cert-manager is REQUIRED for all installations.
cert-manager Required
cert-manager v1.13.0+ is REQUIRED for ALL installations including local development. Webhooks provide validation and defaulting at admission time.
Diagnosis:
# Check if cert-manager is installed
kubectl get pods -n cert-manager
# Check if Certificate resource exists
kubectl get certificate -n lynq-system
# Check Certificate details
kubectl describe certificate -n lynq-system
# Check if secret was created
kubectl get secret -n lynq-system | grep webhook-server-certSolutions:
A. Install cert-manager (if not installed):
# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# Wait for cert-manager to be ready
kubectl wait --for=condition=Available --timeout=300s -n cert-manager \
deployment/cert-manager \
deployment/cert-manager-webhook \
deployment/cert-manager-cainjector
# Verify cert-manager is running
kubectl get pods -n cert-managerB. Restart operator (after cert-manager is ready):
kubectl rollout restart -n lynq-system deployment/lynq-controller-manager
# Watch rollout status
kubectl rollout status -n lynq-system deployment/lynq-controller-managerC. Check Certificate issuance:
# Check if Certificate is Ready
kubectl get certificate -n lynq-system
# If not ready, check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager
# Check if Issuer exists
kubectl get issuer -n lynq-system2. LynqNode Not Creating Resources
Symptoms:
- LynqNode CR exists
- Status shows
desiredResources > 0 - But
readyResources = 0
Diagnosis:
# Check LynqNode status
kubectl get lynqnode <name> -o yaml
# Check events
kubectl describe lynqnode <name>
# Check operator logs
kubectl logs node-name>Common Causes:
A. Template Rendering Error
# Look for: "Failed to render resource"
kubectl describe lynqnode <name> | grep -A5 "TemplateRenderError"Solution: Fix template syntax in LynqForm
B. Missing Variable
# Look for: "map has no entry for key"
kubectl logs -n lynq-system deployment/lynq-controller-manager | grep "missing"Solution: Add missing variable to extraValueMappings
C. Resource Conflict
# Look for: "ResourceConflict"
kubectl describe lynqnode <name> | grep "ResourceConflict"Solution: Delete conflicting resource or use conflictPolicy: Force
3. Database Connection Failures
Error:
Failed to query database: dial tcp: connect: connection refusedDiagnosis:
# Check secret exists
kubectl get secret <mysql-secret> -o yaml
# Check Hub status
kubectl get lynqhub <name> -o yaml
# Test database connection from a pod
kubectl run -it --rm mysql-test --image=mysql:8 --restart=Never -- \
mysql -h <host> -u <user> -p<password> -e "SELECT 1"Solutions:
A. Verify credentials:
kubectl get secret <mysql-secret> -o jsonpath='{.data.password}' | base64 -dB. Check network connectivity:
kubectl exec -n lynq-system deployment/lynq-controller-manager -- \
nc -zv <mysql-host> 3306C. Verify LynqHub configuration:
spec:
source:
mysql:
host: mysql.default.svc.cluster.local # Correct FQDN
port: 3306
database: nodes4. LynqNode Status Not Updating
Symptoms:
- Resources are ready in cluster
- LynqNode status shows
readyResources = 0
Causes:
- Reconciliation not triggered
- Readiness check failing
Solutions:
A. Force reconciliation:
# Add annotation to trigger reconciliation
kubectl annotate lynqnode <name> force-sync="$(date +%s)" --overwriteB. Check readiness logic:
# For Deployments
kubectl get deployment <name> -o jsonpath='{.status}'
# Check if replicas match
kubectl get deployment <name> -o jsonpath='{.spec.replicas} {.status.availableReplicas}'C. Wait longer (resources take time to become ready):
- Deployments: 30s - 2min
- Jobs: Variable
- Ingresses: 10s - 1min
5. Template Variables Not Substituting
Symptoms:
- Template shows
{{ .uid }}literally in resources - Variables not replaced
Cause: Templates not rendered correctly
Diagnosis:
# Check rendered LynqNode spec
kubectl get lynqnode <name> -o jsonpath='{.spec.deployments[0].nameTemplate}'Solution:
- Ensure Hub has correct
valueMappings - Check database column names match mappings
- Verify node row has non-empty values
6. Slow LynqNode Provisioning
Symptoms:
- LynqNodes taking > 5 minutes to provision
- High operator CPU usage
Diagnosis:
# Check reconciliation times
kubectl logs -n lynq-system deployment/lynq-controller-manager | \
grep "Reconciliation completed" | tail -20
# Check resource counts
kubectl get lynqnodes -o json | jq '.items[] | {name: .metadata.name, desired: .status.desiredResources}'Solutions:
A. Disable readiness waits:
waitForReady: falseB. Increase concurrency:
args:
- --node-concurrency=20 # Increase LynqNode reconciliation concurrency
- --form-concurrency=10 # Increase Template reconciliation concurrency
- --hub-concurrency=5 # Increase Hub reconciliation concurrencyC. Optimize templates (see Performance Guide)
7. Memory/CPU Issues
Symptoms:
- Operator pod OOMKilled
- High CPU usage
Diagnosis:
# Check resource usage
kubectl top pod -n lynq-system
# Check for memory leaks
kubectl logs -n lynq-system deployment/lynq-controller-manager --previousSolutions:
A. Increase resource limits:
resources:
limits:
cpu: 2000m
memory: 2GiB. Reduce concurrency:
args:
- --node-concurrency=5 # Reduce LynqNode reconciliation concurrency
- --form-concurrency=3 # Reduce Template reconciliation concurrency
- --hub-concurrency=1 # Reduce Hub reconciliation concurrencyC. Increase requeue interval:
args:
- --requeue-interval=1m8. Finalizer Stuck
Symptoms:
- LynqNode CR stuck in
Terminatingstate - Can't delete LynqNode
Diagnosis:
# Check finalizers
kubectl get lynqnode <name> -o jsonpath='{.metadata.finalizers}'
# Check deletion timestamp
kubectl get lynqnode <name> -o jsonpath='{.metadata.deletionTimestamp}'Solutions:
A. Check operator logs for deletion errors:
kubectl logs -n lynq-system deployment/lynq-controller-manager | \
grep "Failed to delete"B. Force remove finalizer (last resort):
kubectl patch lynqnode <name> -p '{"metadata":{"finalizers":[]}}' --type=mergeWarning: This may leave orphaned resources!
9. Hub Not Syncing
Symptoms:
- Database has active rows
- No LynqNode CRs created
Diagnosis:
# Check Hub status
kubectl get lynqhub <name> -o yaml
# Check operator logs
kubectl logs -n lynq-system deployment/lynq-controller-manager | \
grep "Hub"Common Causes:
A. Incorrect valueMappings:
# Must match database columns exactly
valueMappings:
uid: node_id # Column must exist
hostOrUrl: node_url # Column must exist
activate: is_active # Column must existB. No active rows:
-- Check for active nodes
SELECT COUNT(*) FROM nodes WHERE is_active = TRUE;C. Database query error:
# Check logs for SQL errors
kubectl logs -n lynq-system deployment/lynq-controller-manager | \
grep "Failed to query"10. Multi-Form Issues
Symptoms:
- Expected 2× nodes, only seeing 1×
- Wrong desired count
Diagnosis:
# Check Hub status
kubectl get lynqhub <name> -o jsonpath='{.status}'
# Should show:
# referencingTemplates: 2
# desired: <forms> × <rows>
# Check forms reference same hub
kubectl get lynqforms -o jsonpath='{.items[*].spec.hubId}'Solution: Ensure all forms correctly reference the hub:
spec:
hubId: my-hub # Must match exactly11. Orphaned Resources Not Cleaning Up
Symptoms:
- Resources removed from LynqForm still exist in cluster
appliedResourcesstatus not updating- Unexpected resources with node labels/ownerReferences
Diagnosis:
# Check current applied resources
kubectl get lynqnode <name> -o jsonpath='{.status.appliedResources}'
# Should show: ["Deployment/default/app@deploy-1", "Service/default/app@svc-1"]
# List resources with node labels
kubectl get all -l lynq.sh/node=<lynqnode-name>
# Find orphaned resources (retained with DeletionPolicy=Retain)
kubectl get all -A -l lynq.sh/orphaned=true
# Find orphaned resources from this node
kubectl get all -A -l lynq.sh/orphaned=true,lynq.sh/node=<lynqnode-name>
# Check resource DeletionPolicy
kubectl get lynqform <name> -o yaml | grep -A2 deletionPolicyCommon Causes:
- DeletionPolicy=Retain: Resource was intentionally retained and marked with orphan labels
- Status not syncing: AppliedResources field not updated
- Manual resource modification: OwnerReference or labels removed manually
- Operator version: Upgrade from version without orphan cleanup
Expected Behavior
Resources with DeletionPolicy=Retain are intentionally kept in the cluster and marked with orphan labels for easy identification. This is not a bug - it's the designed behavior!
Solutions:
A. Verify DeletionPolicy:
# Check template definition
deployments:
- id: old-deployment
deletionPolicy: Delete # Should be Delete, not RetainB. Force reconciliation:
# Trigger reconciliation by updating an annotation
kubectl annotate lynqnode <name> force-sync="$(date +%s)" --overwrite
# Watch logs
kubectl logs -n lynq-system deployment/lynq-controller-manager -fC. Manual cleanup (if needed):
# Delete orphaned resource manually
kubectl delete deployment <orphaned-resource>
# Or remove owner reference if you want to keep it
kubectl patch deployment <name> --type=json -p='[{"op": "remove", "path": "/metadata/ownerReferences"}]'D. Check status update:
# Verify appliedResources is being updated
kubectl get lynqnode <name> -o jsonpath='{.status.appliedResources}' | jq
# Should reflect current template resources onlyPrevention:
- Use
deletionPolicy: Deletefor resources that should be cleaned up - Monitor
appliedResourcesstatus field regularly - Test template changes in non-production first
- Review orphan cleanup behavior in Policies Guide
12. LynqNode Showing Degraded with ResourcesNotReady
New in v1.1.4
The ResourcesNotReady degraded condition provides granular visibility into resources that haven't reached ready state yet.
Symptoms:
- LynqNode condition
Degraded=Truewith reasonResourcesNotReady - Not all resources showing as ready even though they exist
readyResources < desiredResourcesin LynqNode status- Ready condition shows
status=Falsewith reasonNotAllResourcesReady
Diagnosis:
# Check LynqNode status
kubectl get lynqnode <name> -o jsonpath='{.status}' | jq
# Should show:
# "conditions": [
# {"type": "Ready", "status": "False", "reason": "NotAllResourcesReady"},
# {"type": "Degraded", "status": "True", "reason": "ResourcesNotReady"}
# ]
# Check resource readiness
kubectl get lynqnode <name> -o jsonpath='{.status.readyResources} / {.status.desiredResources}'
# Identify which resources are not ready
kubectl describe lynqnode <name>
# Check recent events
kubectl get events --field-selector involvedObject.name=<lynqnode-name> --sort-by='.lastTimestamp'Common Causes:
- Resources still starting up: Normal during initial provisioning
- Resource readiness checks failing: Container failing health checks
- Dependency not satisfied: Waiting for dependent resources
- Timeout exceeded: Resource taking longer than
timeoutSeconds - Image pull errors: Container images not available
- Insufficient resources: Not enough CPU/memory in cluster
Solutions:
A. Check resource status:
# For Deployments
kubectl get deployment <name> -o jsonpath='{.status}' | jq
# Check if replicas match
kubectl get deployment <name> -o jsonpath='{.spec.replicas} desired, {.status.availableReplicas} available'
# For StatefulSets
kubectl get statefulset <name> -o jsonpath='{.status.readyReplicas}/{.spec.replicas} ready'
# For Jobs
kubectl get job <name> -o jsonpath='{.status.succeeded}'
# For Services (should be immediate)
kubectl get service <name>
# Check pod status
kubectl get pods -l app=<name>
kubectl describe pod <pod-name>B. Check resource logs:
# Deployment pods
kubectl logs deployment/<name> --tail=50
# Job logs
kubectl logs job/<name>
# Check for errors
kubectl logs deployment/<name> --previous # If pod crashedC. Check readiness probes:
# See if readiness probes are configured correctly
kubectl get deployment <name> -o jsonpath='{.spec.template.spec.containers[*].readinessProbe}' | jqD. Adjust timeouts if needed:
# In LynqForm
deployments:
- id: app
timeoutSeconds: 600 # Increase from default 300s
waitForReady: true # Ensure readiness checks are enabled
spec:
template:
spec:
containers:
- name: app
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10 # Allow time to start
periodSeconds: 5E. Monitor reconciliation:
# Watch LynqNode status updates (30-second interval in v1.1.4)
watch -n 5 'kubectl get lynqnode <name> -o jsonpath="{.status.readyResources}/{.status.desiredResources} ready"'
# Watch pod status
watch kubectl get pods -l lynq.sh/node=<name>Expected Behavior:
v1.1.4+ Fast Status Updates
- Status updates every 30 seconds (down from 5 minutes in earlier versions)
- Event-driven: Immediate reconciliation when child resources change
- Resources typically become ready within 1-2 minutes (depending on resource type)
- Degraded condition clears automatically when all resources reach ready state
When to Investigate Further:
- Resources stuck "not ready" for > 5 minutes
readyResourcescount not increasing over time- Events show repeated failures or errors
- Logs indicate application-level issues
Prevention:
- Set realistic
timeoutSecondsvalues for resource types (Deployments: 300s, Jobs: 600s) - Ensure resource specifications have correct readiness probes
- Test templates in non-production environments first
- Monitor
lynqnode_resources_readyandlynqnode_degraded_statusmetrics - Use
kubectl waitfor pre-flight checks:bashkubectl wait --for=condition=Ready lynqnode/<name> --timeout=300s
Debugging Workflows
Debug Template Rendering
- Create test LynqNode manually:
apiVersion: operator.lynq.sh/v1
kind: LynqNode
metadata:
name: test-node
annotations:
lynq.sh/uid: "test-123"
lynq.sh/host: "test.example.com"
spec:
# ... copy from template- Check rendered resources:
kubectl get lynqnode -o yaml- Check operator logs:
kubectl logs -n lynq-system deployment/lynq-controller-manager -fDebug Database Connection
- Create test pod:
kubectl run -it --rm mysql-test --image=mysql:8 --restart=Never -- bash- Inside pod:
mysql -h <host> -u <user> -p<password> <database> -e "SELECT * FROM nodes LIMIT 5"Debug Reconciliation
- Enable debug logging:
# config/manager/manager.yaml
args:
- --zap-log-level=debug- Watch reconciliation:
kubectl logs -n lynq-system deployment/lynq-controller-manager -f | \
grep "Reconciling"Getting Help
- Check operator logs
- Check LynqNode events:
kubectl describe lynqnode <name> - Check Hub status:
kubectl get lynqhub <name> -o yaml - Review Performance Guide
- Open issue: https://github.com/k8s-lynq/lynq/issues
Include in bug reports:
- Operator version
- Kubernetes version
- Operator logs
- LynqNode/Hub/Template YAML
- Steps to reproduce
