Operations
Monitoring
Prometheus metrics, health checks, operational dashboards, and alerting.
The Realtime Platform provides comprehensive monitoring through Prometheus-compatible metrics, health endpoints, and built-in operational dashboards.
Health Check
GET /healthReturns {"status":"ok"} with HTTP 200. No authentication required. Use this for load balancer and orchestrator health probes.
Prometheus Metrics
GET /api/metrics/prometheusReturns metrics in Prometheus text format for scraping. Configure your Prometheus instance to scrape this endpoint.
Available Metrics
| Metric | Type | Description |
|---|---|---|
realtime_events_received_total | Counter | Total events received by source and topic |
realtime_events_delivered_total | Counter | Total events delivered to subscribers |
realtime_events_filtered_total | Counter | Total events filtered out by subscription filters |
realtime_broadcast_latency_seconds | Histogram | Event broadcast latency (p50, p95, p99) |
realtime_websocket_connections | Gauge | Current active WebSocket connections |
realtime_queue_depth | Gauge | BullMQ queue depth by queue name |
realtime_webhook_deliveries_total | Counter | Webhook delivery attempts by status |
realtime_cdc_lag_seconds | Gauge | CDC replication lag |
Prometheus Configuration
scrape_configs:
- job_name: 'realtime'
scrape_interval: 15s
static_configs:
- targets: ['localhost:3000']
metrics_path: '/api/metrics/prometheus'Dashboard Metrics API
The platform provides pre-aggregated dashboard metrics:
GET /api/metrics/dashboardReturns structured metrics for the three operational dashboards:
{
"delivery": {
"activeConnections": 42,
"reconnectRate": 0.02,
"eventsInPerSec": 150,
"eventsOutPerSec": 145,
"fanoutLatencyP50": 12,
"fanoutLatencyP95": 45,
"fanoutLatencyP99": 89
},
"pipeline": {
"cdcLag": 250,
"outboxLag": 100,
"eventsCapturedPerSec": 50,
"eventsRoutedPerSec": 48,
"mappingFailures": 0,
"schemaValidationFailures": 0
},
"reliability": {
"queueDepth": 5,
"retryCount": 2,
"dlqSize": 0,
"webhookSuccessRate": 0.99,
"webhookFailureRate": 0.01
}
}Hot Topics
GET /api/metrics/hot-topicsReturns the most active topics with real-time statistics:
[
{
"topic": "session.status",
"eventsPerSec": 25.5,
"subscriberCount": 150,
"p95FanoutLatency": 32
}
]Admin UI Dashboards
The Admin UI provides three dedicated monitoring dashboards:
Realtime Delivery (/dashboards/delivery)
- Active WebSocket connections over time
- Events per second by source (database, sync, socket)
- Fanout latency percentiles
- Hot topics breakdown
- Dropped broadcasts and duplicate suppression
Database Pipeline (/dashboards/pipeline)
- CDC lag and outbox lag
- Events captured vs routed per second
- Mapping failure rate
- Schema validation failure rate
- Filter drop percentage
Async Reliability (/dashboards/reliability)
- BullMQ queue depth over time
- Retry counts and dead-letter queue size
- Webhook success/failure rates
- Worker heartbeat status
Alerting
The Alert Center (/alerts) in the Admin UI displays system alerts:
- Configurable alert rules with severity levels (info, warning, critical)
- Cooldown periods to prevent alert storms
- Environment-scoped alerts
- Alert history for audit trails
Logging
All services use pino structured JSON logging:
import { getLogger } from '@realtime/observability';
const logger = getLogger();
logger.info({ topic: 'session.status', eventId: 'evt_123' }, 'Event routed');
logger.error({ err, webhookId: 'wh_456' }, 'Webhook delivery failed');Log Levels
| Level | LOG_LEVEL | Description |
|---|---|---|
debug | debug | Verbose debugging (development only) |
info | info | Standard operational messages (default) |
warn | warn | Warnings that don't affect functionality |
error | error | Errors requiring attention |
Set LOG_LEVEL in your environment configuration.