Implement Metrics and Monitoring System #27

New issue

Closed

opened 2026-02-13 18:32:09 +03:00 by NiXTheDev · 1 comment

NiXTheDev commented

2026-02-13 18:32:09 +03:00

(Migrated from github.com)

Implement a simplified internal health monitoring system for the worker pool.

Scope

Create a minimal internal health checker that monitors worker pool state for self-diagnostic purposes. No external infrastructure or HTTP endpoints.

Health Metrics to Track

Worker pool statistics (active workers, idle workers, queue depth)
Task processing duration (average, max)
Error rates (failed tasks vs successful)
Simple health status: healthy/degraded/unhealthy

Implementation Details

Create a HealthMonitor class that collects pool statistics periodically
Track metrics in-memory (no persistence needed for health checks)
Expose health status via existing getStats() methods
Add health check to /health command (if added later)
Log health status changes at WARN level

No External Infrastructure

No HTTP endpoints
No Prometheus/metrics export
No additional databases
Pure in-memory collection

Configuration

Add environment variables:

HEALTH_CHECK_ENABLED - Enable health monitoring (default: true)
HEALTH_CHECK_INTERVAL_MS - How often to check health (default: 60000)
HEALTH_WORKER_THRESHOLD - Min workers before degraded (default: 1)
HEALTH_QUEUE_THRESHOLD - Max queue depth before degraded (default: 100)

Usage

Health status available via:

WorkerPool.getStats().healthStatus
WorkerPoolV2.getStats().healthStatus
Logs when status changes

Testing

Unit tests for HealthMonitor
Verify health transitions correctly
Test threshold configurations

Implement a simplified internal health monitoring system for the worker pool. ## Scope Create a minimal internal health checker that monitors worker pool state for self-diagnostic purposes. No external infrastructure or HTTP endpoints. ## Health Metrics to Track - Worker pool statistics (active workers, idle workers, queue depth) - Task processing duration (average, max) - Error rates (failed tasks vs successful) - Simple health status: healthy/degraded/unhealthy ## Implementation Details 1. Create a HealthMonitor class that collects pool statistics periodically 2. Track metrics in-memory (no persistence needed for health checks) 3. Expose health status via existing `getStats()` methods 4. Add health check to `/health` command (if added later) 5. Log health status changes at WARN level ## No External Infrastructure - No HTTP endpoints - No Prometheus/metrics export - No additional databases - Pure in-memory collection ## Configuration Add environment variables: - HEALTH_CHECK_ENABLED - Enable health monitoring (default: true) - HEALTH_CHECK_INTERVAL_MS - How often to check health (default: 60000) - HEALTH_WORKER_THRESHOLD - Min workers before degraded (default: 1) - HEALTH_QUEUE_THRESHOLD - Max queue depth before degraded (default: 100) ## Usage Health status available via: - WorkerPool.getStats().healthStatus - WorkerPoolV2.getStats().healthStatus - Logs when status changes ## Testing - Unit tests for HealthMonitor - Verify health transitions correctly - Test threshold configurations

NiXTheDev commented

2026-02-13 18:53:42 +03:00

(Migrated from github.com)

Export metrics in Prometheus format for scraping

No external infrastructure, we do not need metrics as a whole, aside from quick internal collection about the pool state, to gauge whether it's healthy or not

Add optional HTTP endpoint for metrics (/metrics)

Why would we ever have a need for this?? it's not a webserver, it's just a bot

> 3. Export metrics in Prometheus format for scraping No external infrastructure, we do not need metrics as a whole, aside from quick internal collection about the pool state, to gauge whether it's healthy or not > 4. Add optional HTTP endpoint for metrics (/metrics) Why would we ever have a need for this?? it's not a webserver, it's just a bot