Operations

Monitoring

Prometheus metrics, health checks, operational dashboards, and alerting.

The Realtime Platform provides comprehensive monitoring through Prometheus-compatible metrics, health endpoints, and built-in operational dashboards.

Health Check

GET /health

Returns {"status":"ok"} with HTTP 200. No authentication required. Use this for load balancer and orchestrator health probes.

Prometheus Metrics

GET /api/metrics/prometheus

Returns metrics in Prometheus text format for scraping. Configure your Prometheus instance to scrape this endpoint.

Available Metrics

MetricTypeDescription
realtime_events_received_totalCounterTotal events received by source and topic
realtime_events_delivered_totalCounterTotal events delivered to subscribers
realtime_events_filtered_totalCounterTotal events filtered out by subscription filters
realtime_broadcast_latency_secondsHistogramEvent broadcast latency (p50, p95, p99)
realtime_websocket_connectionsGaugeCurrent active WebSocket connections
realtime_queue_depthGaugeBullMQ queue depth by queue name
realtime_webhook_deliveries_totalCounterWebhook delivery attempts by status
realtime_cdc_lag_secondsGaugeCDC replication lag

Prometheus Configuration

scrape_configs:
  - job_name: 'realtime'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/api/metrics/prometheus'

Dashboard Metrics API

The platform provides pre-aggregated dashboard metrics:

GET /api/metrics/dashboard

Returns structured metrics for the three operational dashboards:

{
  "delivery": {
    "activeConnections": 42,
    "reconnectRate": 0.02,
    "eventsInPerSec": 150,
    "eventsOutPerSec": 145,
    "fanoutLatencyP50": 12,
    "fanoutLatencyP95": 45,
    "fanoutLatencyP99": 89
  },
  "pipeline": {
    "cdcLag": 250,
    "outboxLag": 100,
    "eventsCapturedPerSec": 50,
    "eventsRoutedPerSec": 48,
    "mappingFailures": 0,
    "schemaValidationFailures": 0
  },
  "reliability": {
    "queueDepth": 5,
    "retryCount": 2,
    "dlqSize": 0,
    "webhookSuccessRate": 0.99,
    "webhookFailureRate": 0.01
  }
}

Hot Topics

GET /api/metrics/hot-topics

Returns the most active topics with real-time statistics:

[
  {
    "topic": "session.status",
    "eventsPerSec": 25.5,
    "subscriberCount": 150,
    "p95FanoutLatency": 32
  }
]

Admin UI Dashboards

The Admin UI provides three dedicated monitoring dashboards:

Realtime Delivery (/dashboards/delivery)

  • Active WebSocket connections over time
  • Events per second by source (database, sync, socket)
  • Fanout latency percentiles
  • Hot topics breakdown
  • Dropped broadcasts and duplicate suppression

Database Pipeline (/dashboards/pipeline)

  • CDC lag and outbox lag
  • Events captured vs routed per second
  • Mapping failure rate
  • Schema validation failure rate
  • Filter drop percentage

Async Reliability (/dashboards/reliability)

  • BullMQ queue depth over time
  • Retry counts and dead-letter queue size
  • Webhook success/failure rates
  • Worker heartbeat status

Alerting

The Alert Center (/alerts) in the Admin UI displays system alerts:

  • Configurable alert rules with severity levels (info, warning, critical)
  • Cooldown periods to prevent alert storms
  • Environment-scoped alerts
  • Alert history for audit trails

Logging

All services use pino structured JSON logging:

import { getLogger } from '@realtime/observability';
const logger = getLogger();

logger.info({ topic: 'session.status', eventId: 'evt_123' }, 'Event routed');
logger.error({ err, webhookId: 'wh_456' }, 'Webhook delivery failed');

Log Levels

LevelLOG_LEVELDescription
debugdebugVerbose debugging (development only)
infoinfoStandard operational messages (default)
warnwarnWarnings that don't affect functionality
errorerrorErrors requiring attention

Set LOG_LEVEL in your environment configuration.