Skip to main content

Five Best-of-Breed Components, One Cohesive Platform

Rather than relying on a single tool, GTP IT Guardian assembles the best open-source components into an integrated, enterprise-grade monitoring stack — each serving a distinct operational role.

Nagios Core

Battle-tested alert engine

Active host and service checks (ping, SSH, HTTP, disk, CPU) on a configurable schedule — fires alerts the moment something goes wrong.

Prometheus

Modern time-series metrics backbone

Scrapes hundreds of metrics per second via node_exporter agents, stores them in a high-performance TSDB, and feeds Grafana and the AI engine.

Grafana

Per-tenant dashboards & visualisation

Each tenant gets their own isolated Grafana organisation with pre-provisioned dashboards scoped to their hosts — they only ever see their own data.

FastAPI Control Plane

JWT-authenticated orchestration brain

Handles tenant and host lifecycle operations, auto-provisioning Nagios configs, Prometheus file-SD entries, and Grafana organisations on every registration.

AI Engine

Isolation Forest & capacity forecasting

Builds per-host behavioural models, runs anomaly scoring every 5 minutes, and forecasts disk exhaustion via polynomial regression — catching problems before outages.

PostgreSQL & Redis

Persistent state & rate limiting

PostgreSQL stores tenants, users, hosts, billing records and refresh tokens. Redis provides sliding-window rate limiting and high-speed config caching.

MinIO Object Storage

S3-compatible ML model persistence

Stores serialised Isolation Forest models and StandardScalers so the AI engine retains learned baselines across container restarts and redeployments.


Full Logical Isolation Across Every Layer

Each customer organisation gets their own isolated dashboards, alerts, and data on a single shared platform — operations teams manage centrally while tenants experience complete isolation.

PostgreSQL: All tables carry a tenant_id UUID foreign key — TENANT_ADMIN users only query rows scoped to their own tenant
Prometheus: Every metric scraped carries a tenant="slug" label — Grafana datasources auto-filter to that label by default
Grafana: Each tenant maps to a separate Grafana organisation — users from one org cannot see another org's dashboards or datasources
Nagios: Each tenant has an isolated config subdirectory — generated and managed by the API, tenants never touch Nagios directly
New tenant provisioning automatically creates a Grafana org, Prometheus file-SD directory, and Nagios config directory in a single API call
RBAC Role Reference
super_admin
Full platform access — create/delete tenants, view all data, generate invoices, trigger model retrains
tenant_admin
Full access to their own tenant — register/deregister hosts, view dashboards, billing, and alerts
tenant_viewer
Read-only access to own tenant's hosts, alerts, and dashboards — ideal for NOC staff
billing_admin
Access to billing and usage data for their tenant — no config write access

AI-Powered Anomaly Detection & Capacity Forecasting

The Guardian AI Engine runs three continuous ML workflows — detecting behavioural anomalies, forecasting resource exhaustion, and clustering correlated alerts into root-cause events.

Isolation Forest anomaly detection — per-host models trained on 24 hours of Prometheus data, scoring every 5 minutes across CPU, memory, disk, load, and network metrics
Capacity forecasting — polynomial regression (degree 2) on 30 days of disk usage predicts the exact date a partition will reach 100% utilisation
DBSCAN alert clustering — groups correlated Alertmanager alerts by label vectors, surfacing a single root-cause event from downstream alert storms
Models serialised to MinIO — baselines survive container restarts and new deployments without retraining from scratch
Manual retrain API endpoint — immediately rebuild models after bulk host registration or historical data ingestion
AI Engine — Live Status
0.09

Avg Anomaly Score

142

Days to Disk Full

5 min

Detection Cycle

0.87

Forecast Confidence

Isolation Forest
contamination=0.05 · 6 feature metrics · StandardScaler
Disk Predictor
Polynomial regression deg 2 · 30-day window · per host
Alert Clusterer
DBSCAN eps=0.5 · min_samples=3 · on-demand

Comprehensive Host & Infrastructure Alert Rules

Guardian ships with pre-configured Prometheus alert rules and Alertmanager routing for critical deduplication and suppression — preventing alert storms while ensuring no incident goes unnoticed.

Nagios active checks: ping, SSH, disk, CPU, memory, HTTP/HTTPS with configurable warning and critical thresholds
Prometheus alert rules covering CPU, memory, disk, host availability, container resources, SLA breach risk, and platform health
Alertmanager deduplication — groups by alertname + tenant + instance with configurable group wait and repeat intervals
Inhibition rules suppress matching warning alerts when a critical alert fires — eliminating downstream noise
Notification channels: Slack, Microsoft Teams, and SMTP email — configured via environment variables, no code changes required

Alert Rules Reference

Host Down (2 min)Critical
CPU > 95% for 5 minCritical
CPU > 85% for 5 minWarning
Memory > 85% for 5 minWarning
Disk < 10% freeCritical
Disk < 20% freeWarning
Container Memory > 85%Warning
SLA Breach Risk (<99.9% uptime)Warning
GTP API Down (1 min)Critical

From Demo to Production in Minutes

Guardian ships as a fully containerised stack supporting both Docker Compose and Kubernetes. A zero-config demo stack with six pre-seeded target systems lets you evaluate the full platform without any prerequisites.

Docker Compose demo with six pre-seeded hosts — start the complete platform with make demo, no credentials required
Production Kubernetes manifests with HPA auto-scaling — API scales 2–10 replicas on CPU/memory pressure, AI Engine scales 1–4 replicas
Let's Encrypt TLS with Nginx reverse proxy — automated certificate provisioning via make ssl-letsencrypt
Makefile-driven CLI tenant onboarding — make onboard-tenant provisions a full tenant in one command
Automated nightly backup — PostgreSQL, Grafana dashboards, Prometheus rules, and Nagios config archived with 30-day retention

Quick Start — Demo Stack

# Clone and launch the full demo
git clone https://github.com/your-org/gtp-saas-monitoring
cd gtp-saas-monitoring
make demo

Services Available

:3000Grafana — Per-tenant dashboards
:8000FastAPI Control Plane + Swagger UI
:8001AI Engine — Anomaly & forecasting API
:8080Nagios Core web interface
:9090Prometheus TSDB
:9093Alertmanager

Transparent Per-Host Pricing with Stripe Integration

Guardian includes built-in multi-tenant billing with Stripe — usage snapshots, invoice generation, and overage tracking all managed through the control plane API.

Free
5

hosts included

$5.00 / host / month
$2.50 overage
Starter
10

hosts included

$5.00 / host / month
$2.50 overage
Professional
50

hosts included

$5.00 / host / month
AI anomaly detection included
Enterprise

hosts (configurable)

$5.00 / host / month
Custom max_hosts via API

What Makes GTP IT Guardian Unique

True Multi-Tenancy

Complete data isolation at every layer — PostgreSQL, Prometheus, Grafana, and Nagios — with automatic provisioning on tenant creation.

AI Anomaly Engine

Isolation Forest models built per host from 24 hours of Prometheus data — scores every 5 minutes and fires webhooks on threshold breach.

Enterprise Security

JWT short-lived tokens, rotating refresh tokens, RBAC with four roles, Redis sliding-window rate limiting, and full HTTPS with HSTS.

Built-In Billing

Stripe integration with per-host pricing, overage tracking, usage snapshots, and invoice generation — all managed through the REST API.

Docker & Kubernetes Ready

Full Kubernetes manifests with HPA auto-scaling — deploy to EKS, GKE, AKS, or self-managed clusters with cert-manager TLS.

Zero-Downtime Registration

Host registration triggers Nagios config, Prometheus file-SD, and Grafana provisioning simultaneously — monitoring begins within 60 seconds, no restarts required.

Multi-Channel Alerting

Slack, Microsoft Teams, and SMTP email — configured via environment variables with Alertmanager routing, deduplication, and silence management.

Automated Backup & Recovery

Nightly cron-scheduled backups of PostgreSQL, Grafana, Prometheus rules, and Nagios config — with documented Docker volume recovery for full disaster recovery.


Platform Architecture

All external traffic enters through Nginx (TLS 1.2/1.3, HSTS, rate limiting) and routes to the FastAPI control plane, Grafana, and Nagios. Prometheus scrapes every registered host via file service-discovery.

Internet / Client Browser — HTTPS :443
Nginx Reverse Proxy
TLS 1.2/1.3 · HSTS · CSP · Rate Limiting · X-Frame-Options
▼ routes to ▼
FastAPI Control Plane :8000
JWT · RBAC · Rate Limiting
Tenant & Host Orchestration
Grafana :3000
Per-tenant Orgs
Pre-provisioned Dashboards
Nagios Core :8080
Active Host & Service Checks
Alert State Engine
▼ backed by ▼
Prometheus :9090
TSDB 90d · File SD scraping
AI Engine :8001
Isolation Forest · DBSCAN
PostgreSQL :5432
Tenants · Users · Billing
Redis :6379
Rate limiter · Cache
MinIO :9000
ML model persistence
Monitored targets (node_exporter :9100 per host)
Host A (tenant-1)
Host B (tenant-1)
Host C (tenant-2)

Deploy Enterprise IT Monitoring for Your Clients Today

Schedule a demo to see GTP IT Guardian in action — from the zero-config demo stack through to full Kubernetes production deployment with multi-tenant billing and AI anomaly detection.

Schedule a Demo