Monitoring

The Realtime Platform provides comprehensive monitoring through Prometheus-compatible metrics, health endpoints, and built-in operational dashboards.

Health Check

GET /health

Returns {"status":"ok"} with HTTP 200. No authentication required. Use this for load balancer and orchestrator health probes.

Prometheus Metrics

GET /api/metrics/prometheus

Returns metrics in Prometheus text format for scraping. Configure your Prometheus instance to scrape this endpoint.

Available Metrics

Metric	Type	Description
`realtime_events_received_total`	Counter	Total events received by source and topic
`realtime_events_delivered_total`	Counter	Total events delivered to subscribers
`realtime_events_filtered_total`	Counter	Total events filtered out by subscription filters
`realtime_broadcast_latency_seconds`	Histogram	Event broadcast latency (p50, p95, p99)
`realtime_websocket_connections`	Gauge	Current active WebSocket connections
`realtime_queue_depth`	Gauge	BullMQ queue depth by queue name
`realtime_webhook_deliveries_total`	Counter	Webhook delivery attempts by status
`realtime_cdc_lag_seconds`	Gauge	CDC replication lag

Prometheus Configuration

scrape_configs:
  - job_name: 'realtime'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/api/metrics/prometheus'

Dashboard Metrics API

The platform provides pre-aggregated dashboard metrics:

GET /api/metrics/dashboard

Returns structured metrics for the three operational dashboards:

{
  "delivery": {
    "activeConnections": 42,
    "reconnectRate": 0.02,
    "eventsInPerSec": 150,
    "eventsOutPerSec": 145,
    "fanoutLatencyP50": 12,
    "fanoutLatencyP95": 45,
    "fanoutLatencyP99": 89
  },
  "pipeline": {
    "cdcLag": 250,
    "outboxLag": 100,
    "eventsCapturedPerSec": 50,
    "eventsRoutedPerSec": 48,
    "mappingFailures": 0,
    "schemaValidationFailures": 0
  },
  "reliability": {
    "queueDepth": 5,
    "retryCount": 2,
    "dlqSize": 0,
    "webhookSuccessRate": 0.99,
    "webhookFailureRate": 0.01
  }
}

Hot Topics

GET /api/metrics/hot-topics

Returns the most active topics with real-time statistics:

[
  {
    "topic": "session.status",
    "eventsPerSec": 25.5,
    "subscriberCount": 150,
    "p95FanoutLatency": 32
  }
]

Admin UI Dashboards

The Admin UI provides three dedicated monitoring dashboards:

Realtime Delivery (`/dashboards/delivery`)

Active WebSocket connections over time
Events per second by source (database, sync, socket)
Fanout latency percentiles
Hot topics breakdown
Dropped broadcasts and duplicate suppression

Database Pipeline (`/dashboards/pipeline`)

CDC lag and outbox lag
Events captured vs routed per second
Mapping failure rate
Schema validation failure rate
Filter drop percentage

Async Reliability (`/dashboards/reliability`)

BullMQ queue depth over time
Retry counts and dead-letter queue size
Webhook success/failure rates
Worker heartbeat status

Alerting

The Alert Center (/alerts) in the Admin UI displays system alerts:

Configurable alert rules with severity levels (info, warning, critical)
Cooldown periods to prevent alert storms
Environment-scoped alerts
Alert history for audit trails

Logging

All services use pino structured JSON logging:

import { getLogger } from '@realtime/observability';
const logger = getLogger();

logger.info({ topic: 'session.status', eventId: 'evt_123' }, 'Event routed');
logger.error({ err, webhookId: 'wh_456' }, 'Webhook delivery failed');

Log Levels

Level	`LOG_LEVEL`	Description
`debug`	`debug`	Verbose debugging (development only)
`info`	`info`	Standard operational messages (default)
`warn`	`warn`	Warnings that don't affect functionality
`error`	`error`	Errors requiring attention

Set LOG_LEVEL in your environment configuration.