Monitoring & Observability Guide
Comprehensive guide for monitoring Lynq with Prometheus, Grafana, and Kubernetes events.
Getting Started
Accessing Metrics
Endpoint
Lynq exposes Prometheus metrics at :8443/metrics over HTTPS.
Port-forward for local testing:
# Port-forward to metrics endpoint
kubectl port-forward -n lynq-system \
deployment/lynq-controller-manager 8443:8443
# Access metrics (requires valid TLS client or use --insecure)
curl -k https://localhost:8443/metricsCheck if metrics are enabled:
# Check if metrics port is exposed
kubectl get svc -n lynq-system lynq-controller-manager-metrics-service
# Check if ServiceMonitor is deployed (requires prometheus-operator)
kubectl get servicemonitor -n lynq-systemEnabling ServiceMonitor
If using Prometheus Operator, enable ServiceMonitor by uncommenting in config/default/kustomization.yaml:
# Line 27: Uncomment this
- ../prometheusThen redeploy:
kubectl apply -k config/defaultVerify scrape job
After redeploying, confirm that a ServiceMonitor named lynq-controller-manager appears and that Prometheus discovers the target.
Metrics Overview
Lynq exposes 12 custom Prometheus metrics organized into four categories:
Metrics Summary
| Metric | Type | Description | Key Labels |
|---|---|---|---|
| Controller Metrics | |||
lynqnode_reconcile_duration_seconds | Histogram | LynqNode reconciliation duration | result |
| Resource Metrics | |||
lynqnode_resources_desired | Gauge | Desired resource count per node | lynqnode, namespace |
lynqnode_resources_ready | Gauge | Ready resource count per node | lynqnode, namespace |
lynqnode_resources_failed | Gauge | Failed resource count per node | lynqnode, namespace |
| Hub Metrics | |||
hub_desired | Gauge | Desired LynqNode CRs for a hub | hub, namespace |
hub_ready | Gauge | Ready LynqNode CRs for a hub | hub, namespace |
hub_failed | Gauge | Failed LynqNode CRs for a hub | hub, namespace |
| Apply Metrics | |||
apply_attempts_total | Counter | Resource apply attempts | kind, result, conflict_policy |
| Status Metrics | |||
lynqnode_condition_status | Gauge | LynqNode condition status (0=False, 1=True, 2=Unknown) | lynqnode, namespace, type |
lynqnode_conflicts_total | Counter | Total resource conflicts | lynqnode, namespace, resource_kind, conflict_policy |
lynqnode_resources_conflicted | Gauge | Current resources in conflict state | lynqnode, namespace |
lynqnode_degraded_status | Gauge | LynqNode degraded status (0=Not degraded, 1=Degraded) | lynqnode, namespace, reason |
Detailed Queries
For comprehensive PromQL query examples, see Prometheus Query Examples.
Quick Start Queries
LynqNode Health:
# Ready nodes
lynqnode_condition_status{type="Ready"} == 1
# Degraded nodes
lynqnode_degraded_status == 1
# Resource readiness percentage
(lynqnode_resources_ready / lynqnode_resources_desired) * 100Performance:
# P95 reconciliation latency
histogram_quantile(0.95, rate(lynqnode_reconcile_duration_seconds_bucket[5m]))
# Reconciliation rate
rate(lynqnode_reconcile_duration_seconds_count[5m])
# Error rate
rate(lynqnode_reconcile_duration_seconds{result="error"}[5m])Hub Health:
# Hub health percentage
(hub_ready / hub_desired) * 100
# Total desired nodes
sum(hub_desired)Conflicts:
# Current conflicts
sum(lynqnode_resources_conflicted)
# Conflict rate
rate(lynqnode_conflicts_total[5m])Complete Query Reference
See Prometheus Query Examples for 50+ production-ready queries organized by use case.
Smart Reconciliation Metrics (v1.1.4+)
New in v1.1.4
v1.1.4 introduces enhanced status tracking with a 30-second requeue interval for fast status reflection.
Key Changes:
- Fast Status Updates: Child resource status changes reflected in LynqNode status within 30 seconds (down from 5 minutes)
- Event-Driven: Immediate reconciliation on watched resource changes
- Smart Predicates: Only reconcile on Generation/Annotation changes, not status-only updates
Impact on Metrics:
The 30-second requeue interval means you'll see:
- Higher reconciliation frequency: ~2 reconciles per minute per node
- Lower latency: Status changes propagate faster
- Optimized overhead: Smart predicates filter unnecessary reconciliations
Monitoring Reconciliation Patterns:
# Reconciliation frequency (should show ~2 per minute per node in v1.1.4+)
rate(lynqnode_reconcile_duration_seconds_count[5m])
# P50 latency (should remain low despite faster requeue)
histogram_quantile(0.50, rate(lynqnode_reconcile_duration_seconds_bucket[5m]))
# P95 latency (watch for spikes > 30s)
histogram_quantile(0.95, rate(lynqnode_reconcile_duration_seconds_bucket[5m]))Best Practices:
- Capacity Planning: Monitor reconciliation rate for horizontal scaling decisions
- Latency Tracking: P95 latency should stay under 10s for healthy systems
- Event-Driven Behavior: Most reconciliations should be triggered by resource changes, not periodic requeues
- Watch Predicates: Verify that status-only updates don't trigger full reconciliations
Controller-Runtime Metrics
Standard controller-runtime metrics:
# Work queue depth
workqueue_depth{name="lynqnode"}
# Work queue add rate
rate(workqueue_adds_total{name="lynqnode"}[5m])
# Work queue latency
workqueue_queue_duration_seconds{name="lynqnode"}Metrics Collection
Prometheus ServiceMonitor
To enable ServiceMonitor, uncomment the prometheus section in config/default/kustomization.yaml:
# Uncomment this line:
#- ../prometheusThe ServiceMonitor configuration is available in config/prometheus/monitor.yaml.
Note: For production, use cert-manager for metrics TLS by enabling the cert patch in config/default/kustomization.yaml.
Manual Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'lynq'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- lynq-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_control_plane]
action: keep
regex: controller-manager
- source_labels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: httpsLogging
Log Levels
Configure via --zap-log-level:
args:
- --zap-log-level=info # Options: debug, info, errorLevels:
debug: Verbose logging (template values, API calls)info: Standard logging (reconciliation events)error: Errors only
Structured Logging
All logs are structured JSON:
{
"level": "info",
"ts": "2025-01-15T10:30:00.000Z",
"msg": "Reconciliation completed",
"lynqnode": "acme-prod-template",
"ready": 10,
"failed": 0,
"changed": 2
}Key Log Messages
Reconciliation Events
"msg": "Reconciliation completed"
"msg": "Reconciliation completed with changes"
"msg": "Failed to reconcile node"Resource Events
"msg": "Failed to render resource"
"msg": "Failed to apply resource"
"msg": "Resource not ready within timeout"Hub Events
"msg": "Deleting LynqNode (no longer in desired set)"
"msg": "Successfully deleted LynqNode"Querying Logs
# All logs
kubectl logs -n lynq-system deployment/lynq-controller-manager
# Follow logs
kubectl logs -n lynq-system deployment/lynq-controller-manager -f
# Errors only
kubectl logs -n lynq-system deployment/lynq-controller-manager | grep '"level":"error"'
# Specific node
kubectl logs -n lynq-system deployment/lynq-controller-manager | grep 'acme-prod'
# Reconciliation events
kubectl logs -n lynq-system deployment/lynq-controller-manager | grep "Reconciliation completed"Events
Kubernetes events are emitted for key operations.
Viewing Events
# All LynqNode events
kubectl get events --all-namespaces --field-selector involvedObject.kind=LynqNode
# Specific LynqNode
kubectl describe lynqnode <name>
# Recent events
kubectl get events --sort-by='.lastTimestamp'Event Types
Normal Events
TemplateApplied: Template successfully appliedTemplateAppliedComplete: All resources appliedLynqNodeDeleting: LynqNode deletion startedLynqNodeDeleted: LynqNode deletion completed
Warning Events
TemplateRenderError: Template rendering failedApplyFailed: Resource apply failedResourceConflict: Ownership conflict detectedReadinessTimeout: Resource not ready within timeoutDependencyError: Dependency cycle detectedLynqNodeDeletionFailed: Node deletion failed
Event Examples
# Success
TemplateAppliedComplete: Applied 10 resources (10 ready, 0 failed, 2 changed)
# Conflict
ResourceConflict: Resource conflict detected for default/acme-app (Kind: Deployment, Policy: Stuck).
Another controller or user may be managing this resource.
# Deletion
LynqNodeDeleting: Deleting Node 'acme-prod-template' (template: prod-template, uid: acme) -
no longer in active dataset. This could be due to: row deletion, activate=false, or template change.Dashboards
Grafana Dashboard
A comprehensive Grafana dashboard is available at: config/monitoring/grafana-dashboard.json
How to import:
- Open Grafana UI
- Go to Dashboards → Import
- Upload
config/monitoring/grafana-dashboard.json - Select your Prometheus datasource
Dashboard includes 10 panels:
- Reconciliation Duration (Percentiles) - P50, P95, P99 latency
- Reconciliation Rate - Success vs Error rate
- Error Rate - Gauge showing current error percentage
- Total Desired LynqNodes - Sum across all registries
- Total Ready LynqNodes - Healthy node count
- Total Failed LynqNodes - Failed node count
- Resource Counts by LynqNode - Stacked area chart per node
- Hub Health - Table showing health percentage per hub
- Apply Rate by Kind - Apply attempts by resource type
- Work Queue Depth - Controller queue depths
Alerting
Prometheus Alert Rules
A comprehensive set of Prometheus alert rules is available at: config/prometheus/alerts.yaml
To deploy the alerts:
# Apply the PrometheusRule resource
kubectl apply -f config/prometheus/alerts.yaml
# Or use kustomize
kubectl apply -k config/prometheusAlert Categories
The alert configuration includes three severity levels:
| Severity | Alerts | Description |
|---|---|---|
| Critical | 5 alerts | Immediate action required - production impact |
| Warning | 8 alerts | Investigation needed - potential issues |
| Info | 1 alert | Informational - awareness only |
Critical Alerts:
LynqNodeDegraded- LynqNode in degraded stateLynqNodeResourcesFailed- LynqNode has failed resourcesLynqNodeNotReady- LynqNode not ready for extended periodLynqNodeStatusUnknown- LynqNode condition status unknownHubManyNodesFailure- Many nodes failing in a hub
Warning Alerts:
LynqNodeResourcesMismatch- Ready count doesn't match desiredLynqNodeResourcesConflicted- Resources in conflict stateLynqNodeHighConflictRate- High rate of conflictsHubNodesFailure- Some nodes failingHubSyncIssues- Hub sync problemsLynqNodeReconciliationErrors- High error rateLynqNodeReconciliationSlow- Slow reconciliation performanceHighApplyFailureRate- High apply failure rate
Info Alerts:
LynqNodeNewConflictsDetected- New conflicts detected
Alert Configuration
For complete alert definitions with thresholds and runbook links, see config/prometheus/alerts.yaml.
Sample Alert Rules
Critical:
# LynqNode has failed resources
- alert: LynqNodeResourcesFailed
expr: lynqnode_resources_failed > 0
for: 5m
labels:
severity: critical
annotations:
summary: "LynqNode {{ $labels.lynqnode }} has {{ $value }} failed resource(s)"
runbook_url: "https://lynq.sh/runbooks/node-resources-failed"Warning:
# Resources in conflict
- alert: LynqNodeResourcesConflicted
expr: lynqnode_resources_conflicted > 0
for: 10m
labels:
severity: warning
annotations:
summary: "LynqNode {{ $labels.lynqnode }} has resources in conflict"
runbook_url: "https://lynq.sh/runbooks/node-conflicts"Alert Routing (AlertManager)
Configure AlertManager to route alerts based on severity:
# alertmanager.yml
route:
group_by: ['alertname', 'lynqnode', 'namespace']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
# Warning alerts to Slack
- match:
severity: warning
receiver: 'slack'
# Info alerts to email
- match:
severity: info
receiver: 'email'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack'
slack_configs:
- api_url: '<slack-webhook>'
channel: '#lynq-alerts'Best Practices
1. Monitor Key Metrics
Essential metrics to track:
- Reconciliation duration (P95)
- Error rate
- Resource ready/failed counts
- Hub desired vs ready
2. Set Up Alerts
Minimum recommended alerts:
- Operator down
- High error rate (> 10%)
- Slow reconciliation (P95 > 30s)
- Resources failed (> 0 for 5min)
3. Retain Logs
Recommended log retention:
- Debug logs: 1-3 days
- Info logs: 7-14 days
- Error logs: 30+ days
4. Dashboard Review
Weekly review:
- Reconciliation performance trends
- Error patterns
- Resource health
- Capacity planning
5. Event Monitoring
Monitor events for:
- Conflicts (investigate ownership)
- Timeouts (adjust readiness settings)
- Template errors (fix templates)
Troubleshooting Metrics
Metrics Not Available
Problem: curl https://localhost:8443/metrics returns connection refused.
Solution:
Check if metrics port is configured:
bashkubectl get deployment -n lynq-system lynq-controller-manager -o yaml | grep metrics-bind-addressShould see:
--metrics-bind-address=:8443Check if port is exposed:
bashkubectl get deployment -n lynq-system lynq-controller-manager -o yaml | grep -A 5 "ports:"Should see containerPort 8443.
Check if service exists:
bashkubectl get svc -n lynq-system lynq-controller-manager-metrics-serviceCheck operator logs:
bashkubectl logs -n lynq-system deployment/lynq-controller-manager | grep metrics
No Metrics Data
Problem: Metrics endpoint works but returns no custom metrics.
Solution:
Verify metrics are registered:
bashcurl -k https://localhost:8443/metrics | grep lynqnode_Should see:
lynqnode_reconcile_duration_seconds,lynqnode_resources_ready, etc.Trigger reconciliation:
bash# Apply a test resource kubectl apply -f config/samples/operator_v1_lynqhub.yaml # Wait 30s and check metrics again curl -k https://localhost:8443/metrics | grep lynqnode_reconcile_duration_seconds_countCheck if controllers are running:
bashkubectl logs -n lynq-system deployment/lynq-controller-manager | grep "Starting Controller"
ServiceMonitor Not Working
Problem: Prometheus not scraping metrics.
Solution:
Check if Prometheus Operator is installed:
bashkubectl get crd servicemonitors.monitoring.coreos.comCheck if ServiceMonitor is created:
bashkubectl get servicemonitor -n lynq-systemCheck ServiceMonitor labels match Prometheus selector:
bashkubectl get servicemonitor -n lynq-system lynq-controller-manager-metrics-monitor -o yamlCheck Prometheus logs:
bashkubectl logs -n monitoring prometheus-xyz
TLS Certificate Errors
Problem: x509: certificate signed by unknown authority
Solution:
For development, use --insecure or -k:
curl -k https://localhost:8443/metricsFor production, use cert-manager by enabling the cert patch in config/default/kustomization.yaml:
# Uncomment this line:
#- path: cert_metrics_manager_patch.yamlSee Also
- Prometheus Query Examples - 50+ ready-to-use PromQL queries
config/prometheus/alerts.yaml- Complete alert rule definitionsconfig/monitoring/grafana-dashboard.json- Grafana dashboard- Performance Guide - Performance tuning
- Troubleshooting Guide - Common issues
