Implement Metrics and Monitoring System #27

Closed
opened 2026-02-13 18:32:09 +03:00 by NiXTheDev · 1 comment
NiXTheDev commented 2026-02-13 18:32:09 +03:00 (Migrated from github.com)

Implement a simplified internal health monitoring system for the worker pool.

Scope

Create a minimal internal health checker that monitors worker pool state for self-diagnostic purposes. No external infrastructure or HTTP endpoints.

Health Metrics to Track

  • Worker pool statistics (active workers, idle workers, queue depth)
  • Task processing duration (average, max)
  • Error rates (failed tasks vs successful)
  • Simple health status: healthy/degraded/unhealthy

Implementation Details

  1. Create a HealthMonitor class that collects pool statistics periodically
  2. Track metrics in-memory (no persistence needed for health checks)
  3. Expose health status via existing getStats() methods
  4. Add health check to /health command (if added later)
  5. Log health status changes at WARN level

No External Infrastructure

  • No HTTP endpoints
  • No Prometheus/metrics export
  • No additional databases
  • Pure in-memory collection

Configuration

Add environment variables:

  • HEALTH_CHECK_ENABLED - Enable health monitoring (default: true)
  • HEALTH_CHECK_INTERVAL_MS - How often to check health (default: 60000)
  • HEALTH_WORKER_THRESHOLD - Min workers before degraded (default: 1)
  • HEALTH_QUEUE_THRESHOLD - Max queue depth before degraded (default: 100)

Usage

Health status available via:

  • WorkerPool.getStats().healthStatus
  • WorkerPoolV2.getStats().healthStatus
  • Logs when status changes

Testing

  • Unit tests for HealthMonitor
  • Verify health transitions correctly
  • Test threshold configurations
Implement a simplified internal health monitoring system for the worker pool. ## Scope Create a minimal internal health checker that monitors worker pool state for self-diagnostic purposes. No external infrastructure or HTTP endpoints. ## Health Metrics to Track - Worker pool statistics (active workers, idle workers, queue depth) - Task processing duration (average, max) - Error rates (failed tasks vs successful) - Simple health status: healthy/degraded/unhealthy ## Implementation Details 1. Create a HealthMonitor class that collects pool statistics periodically 2. Track metrics in-memory (no persistence needed for health checks) 3. Expose health status via existing `getStats()` methods 4. Add health check to `/health` command (if added later) 5. Log health status changes at WARN level ## No External Infrastructure - No HTTP endpoints - No Prometheus/metrics export - No additional databases - Pure in-memory collection ## Configuration Add environment variables: - HEALTH_CHECK_ENABLED - Enable health monitoring (default: true) - HEALTH_CHECK_INTERVAL_MS - How often to check health (default: 60000) - HEALTH_WORKER_THRESHOLD - Min workers before degraded (default: 1) - HEALTH_QUEUE_THRESHOLD - Max queue depth before degraded (default: 100) ## Usage Health status available via: - WorkerPool.getStats().healthStatus - WorkerPoolV2.getStats().healthStatus - Logs when status changes ## Testing - Unit tests for HealthMonitor - Verify health transitions correctly - Test threshold configurations
NiXTheDev commented 2026-02-13 18:53:42 +03:00 (Migrated from github.com)
  1. Export metrics in Prometheus format for scraping

No external infrastructure, we do not need metrics as a whole, aside from quick internal collection about the pool state, to gauge whether it's healthy or not

  1. Add optional HTTP endpoint for metrics (/metrics)

Why would we ever have a need for this?? it's not a webserver, it's just a bot

> 3. Export metrics in Prometheus format for scraping No external infrastructure, we do not need metrics as a whole, aside from quick internal collection about the pool state, to gauge whether it's healthy or not > 4. Add optional HTTP endpoint for metrics (/metrics) Why would we ever have a need for this?? it's not a webserver, it's just a bot
Sign in to join this conversation.
No description provided.