Skip to content
GitHub stars

Monitoring & Observability

Lynq exposes 15 Prometheus metrics at :8443/metrics, emits structured Kubernetes events per reconciliation, and ships a pre-built Grafana dashboard.

Setup

Prometheus

Enable ServiceMonitor (requires Prometheus Operator): uncomment the prometheus section in config/default/kustomization.yaml:

yaml
- ../prometheus   # uncomment this line

Then redeploy:

bash
kubectl apply -k config/default

Manual scrape config:

yaml
# prometheus.yml
scrape_configs:
- job_name: lynq
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: [lynq-system]
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_control_plane]
    action: keep
    regex: controller-manager
  - source_labels: [__meta_kubernetes_pod_container_port_name]
    action: keep
    regex: https

Test the endpoint:

bash
kubectl port-forward -n lynq-system deployment/lynq-controller-manager 8443:8443
curl -k https://localhost:8443/metrics | grep lynqnode_

Troubleshoot no metrics data:

bash
# Check flag is set
kubectl get deployment -n lynq-system lynq-controller-manager -o yaml | grep metrics-bind-address
# Should show: --metrics-bind-address=:8443 (not "0")

# Check service exists
kubectl get svc -n lynq-system lynq-controller-manager-metrics-service

# Check ServiceMonitor (if using prometheus-operator)
kubectl get servicemonitor -n lynq-system

Grafana Dashboard

Import config/monitoring/grafana-dashboard.json via Dashboards → Import in the Grafana UI. Select your Prometheus datasource. The dashboard includes 10 panels:

  1. Reconciliation Duration (P50 / P95 / P99)
  2. Reconciliation Rate (success vs error)
  3. Error Rate gauge
  4. Total Desired / Ready / Failed LynqNodes
  5. Resource Counts by LynqNode
  6. Hub Health table
  7. Apply Rate by resource kind
  8. Work Queue Depth

Metrics Catalog

MetricTypeLabelsDescription
lynqnode_reconcile_duration_secondsHistogramresultLynqNode reconciliation latency
lynqnode_resources_desiredGaugelynqnode, namespaceDesired resource count per node
lynqnode_resources_readyGaugelynqnode, namespaceReady resource count per node
lynqnode_resources_failedGaugelynqnode, namespaceFailed resource count per node
lynqnode_resources_conflictedGaugelynqnode, namespaceResources in conflict state
lynqnode_conflicts_totalCounterlynqnode, namespace, resource_kind, conflict_policyTotal conflicts detected
lynqnode_condition_statusGaugelynqnode, namespace, typeCondition status (0=False, 1=True, 2=Unknown)
lynqnode_degraded_statusGaugelynqnode, namespace, reasonDegraded status (0=healthy, 1=degraded)
hub_desiredGaugehub, namespaceDesired LynqNode count for a hub
hub_readyGaugehub, namespaceReady LynqNode count
hub_failedGaugehub, namespaceFailed LynqNode count
apply_attempts_totalCounterkind, result, conflict_policyResource apply attempts
lynqform_rollout_updating_nodesGaugeform, namespaceNodes currently being updated (v1.1.16+)
lynqform_rollout_phaseGaugeform, namespaceRollout phase: 0=Idle, 1=InProgress, 2=Failed, 3=Complete (v1.1.16+)
lynqform_rollout_progressGaugeform, namespaceRollout progress percentage (v1.1.16+)

For PromQL queries using these metrics, see Prometheus Query Examples.

Events

Kubernetes events are emitted for key lifecycle transitions.

bash
# Events for a specific node
kubectl describe lynqnode <name>

# All LynqNode events
kubectl get events -A --field-selector involvedObject.kind=LynqNode --sort-by='.lastTimestamp'
EventTypeMeaning
TemplateAppliedNormalResources applied successfully
LynqNodeDeletingNormalNode deletion started
LynqNodeDeletedNormalNode deletion completed
TemplateRenderErrorWarningTemplate syntax or variable error
ApplyFailedWarningResource apply failed (RBAC, quota, etc.)
ResourceConflictWarningSSA field-manager conflict detected
ForceApplyWarningOwnership taken with conflictPolicy: Force
DependencySkippedWarningResource skipped — dependency failed
ReadinessTimeoutWarningResource did not become ready in time
DependencyErrorWarningDependency cycle detected
LynqNodeDeletionFailedWarningNode cleanup failed

Logging

Configure log level via the --zap-log-level flag: debug, info (default), error.

All logs are structured JSON:

json
{"level":"info","ts":"2025-01-15T10:30:00Z","msg":"Reconciliation completed","lynqnode":"acme-web","ready":10,"failed":0}
bash
# Follow all logs
kubectl logs -n lynq-system deployment/lynq-controller-manager -f

# Errors only
kubectl logs -n lynq-system deployment/lynq-controller-manager | grep '"level":"error"'

# Specific node
kubectl logs -n lynq-system deployment/lynq-controller-manager | grep "acme-web"

Alerting

Alert rules are in config/prometheus/alerts.yaml. Deploy with:

bash
kubectl apply -f config/prometheus/alerts.yaml
SeverityAlerts
CriticalLynqNodeDegraded, LynqNodeResourcesFailed, LynqNodeNotReady, LynqNodeStatusUnknown, HubManyNodesFailure
WarningLynqNodeResourcesMismatch, LynqNodeResourcesConflicted, LynqNodeHighConflictRate, HubNodesFailure, HubDesiredCountMismatch, LynqNodeReconciliationErrors, LynqNodeReconciliationSlow, HighApplyFailureRate
InfoLynqNodeNewConflictsDetected

For per-alert diagnosis and resolution steps, see Alert Runbooks.

See Also