GTP IT Guardian

Platform Overview

Five Best-of-Breed Components, One Cohesive Platform

Rather than relying on a single tool, GTP IT Guardian assembles the best open-source components into an integrated, enterprise-grade monitoring stack - each serving a distinct operational role.

Nagios Core

Battle-tested alert engine

Active host and service checks (ping, SSH, HTTP, disk, CPU) on a configurable schedule - fires alerts the moment something goes wrong.

Prometheus

Modern time-series metrics backbone

Scrapes hundreds of metrics per second via node_exporter agents, stores them in a high-performance TSDB, and feeds Grafana and the AI engine.

Grafana

Per-tenant dashboards & visualisation

Each tenant gets their own isolated Grafana organisation with pre-provisioned dashboards scoped to their hosts - they only ever see their own data.

FastAPI Control Plane

JWT-authenticated orchestration brain

Handles tenant and host lifecycle operations, auto-provisioning Nagios configs, Prometheus file-SD entries, and Grafana organisations on every registration.

AI Engine

Isolation Forest & capacity forecasting

Builds per-host behavioural models, runs anomaly scoring every 5 minutes, and forecasts disk exhaustion via polynomial regression - catching problems before outages.

PostgreSQL & Redis

Persistent state & rate limiting

PostgreSQL stores tenants, users, hosts, billing records and refresh tokens. Redis provides sliding-window rate limiting and high-speed config caching.

MinIO Object Storage

S3-compatible ML model persistence

Stores serialised Isolation Forest models and StandardScalers so the AI engine retains learned baselines across container restarts and redeployments.

Multi-Tenancy

Full Logical Isolation Across Every Layer

Each customer organisation gets their own isolated dashboards, alerts, and data on a single shared platform - operations teams manage centrally while tenants experience complete isolation.

PostgreSQL: All tables carry a tenant_id UUID foreign key - TENANT_ADMIN users only query rows scoped to their own tenant

Prometheus: Every metric scraped carries a tenant="slug" label - Grafana datasources auto-filter to that label by default

Grafana: Each tenant maps to a separate Grafana organisation - users from one org cannot see another org's dashboards or datasources

Nagios: Each tenant has an isolated config subdirectory - generated and managed by the API, tenants never touch Nagios directly

New tenant provisioning automatically creates a Grafana org, Prometheus file-SD directory, and Nagios config directory in a single API call

RBAC Role Reference

super_admin

Full platform access - create/delete tenants, view all data, generate invoices, trigger model retrains

tenant_admin

Full access to their own tenant - register/deregister hosts, view dashboards, billing, and alerts

tenant_viewer

Read-only access to own tenant's hosts, alerts, and dashboards - ideal for NOC staff

billing_admin

Access to billing and usage data for their tenant - no config write access

Intelligent Monitoring

AI-Powered Anomaly Detection & Capacity Forecasting

The Guardian AI Engine runs three continuous ML workflows - detecting behavioural anomalies, forecasting resource exhaustion, and clustering correlated alerts into root-cause events.

Isolation Forest anomaly detection - per-host models trained on 24 hours of Prometheus data, scoring every 5 minutes across CPU, memory, disk, load, and network metrics

Capacity forecasting - polynomial regression (degree 2) on 30 days of disk usage predicts the exact date a partition will reach 100% utilisation

DBSCAN alert clustering - groups correlated Alertmanager alerts by label vectors, surfacing a single root-cause event from downstream alert storms

Models serialised to MinIO - baselines survive container restarts and new deployments without retraining from scratch

Manual retrain API endpoint - immediately rebuild models after bulk host registration or historical data ingestion

AI Engine - Live Status

0.09

Avg Anomaly Score

142

Days to Disk Full

5 min

Detection Cycle

0.87

Forecast Confidence

Isolation Forest

contamination=0.05 · 6 feature metrics · StandardScaler

Disk Predictor

Polynomial regression deg 2 · 30-day window · per host

Alert Clusterer

DBSCAN eps=0.5 · min_samples=3 · on-demand

Alerting

Comprehensive Host & Infrastructure Alert Rules

Guardian ships with pre-configured Prometheus alert rules and Alertmanager routing for critical deduplication and suppression - preventing alert storms while ensuring no incident goes unnoticed.

Nagios active checks: ping, SSH, disk, CPU, memory, HTTP/HTTPS with configurable warning and critical thresholds

Prometheus alert rules covering CPU, memory, disk, host availability, container resources, SLA breach risk, and platform health

Alertmanager deduplication - groups by alertname + tenant + instance with configurable group wait and repeat intervals

Inhibition rules suppress matching warning alerts when a critical alert fires - eliminating downstream noise

Notification channels: Slack, Microsoft Teams, and SMTP email - configured via environment variables, no code changes required

Alert Rules Reference

Host Down (2 min)Critical

CPU > 95% for 5 minCritical

CPU > 85% for 5 minWarning

Memory > 85% for 5 minWarning

Disk < 10% freeCritical

Disk < 20% freeWarning

Container Memory > 85%Warning

SLA Breach Risk (<99.9% uptime)Warning

GTP API Down (1 min)Critical

Deployment

From Demo to Production in Minutes

Guardian ships as a fully containerised stack supporting both Docker Compose and Kubernetes. A zero-config demo stack with six pre-seeded target systems lets you evaluate the full platform without any prerequisites.

Docker Compose demo with six pre-seeded hosts - start the complete platform with make demo, no credentials required

Production Kubernetes manifests with HPA auto-scaling - API scales 2–10 replicas on CPU/memory pressure, AI Engine scales 1–4 replicas

Let's Encrypt TLS with Nginx reverse proxy - automated certificate provisioning via make ssl-letsencrypt

Makefile-driven CLI tenant onboarding - make onboard-tenant provisions a full tenant in one command

Automated nightly backup - PostgreSQL, Grafana dashboards, Prometheus rules, and Nagios config archived with 30-day retention

Quick Start - Demo Stack


                  # Clone and launch the full demo

                  git clone https://github.com/your-org/gtp-saas-monitoring

                  cd gtp-saas-monitoring

                  make demo

Services Available

:3000Grafana - Per-tenant dashboards

:8000FastAPI Control Plane + Swagger UI

:8001AI Engine - Anomaly & forecasting API

:8080Nagios Core web interface

:9090Prometheus TSDB

:9093Alertmanager

Billing & Plans

Transparent Per-Host Pricing with Stripe Integration

Guardian includes built-in multi-tenant billing with Stripe - usage snapshots, invoice generation, and overage tracking all managed through the control plane API.

Free

5

hosts included

$5.00 / host / month

$2.50 overage

Starter

10

hosts included

$5.00 / host / month

$2.50 overage

Professional

50

hosts included

$5.00 / host / month

AI anomaly detection included

Enterprise

∞

hosts (configurable)

$5.00 / host / month

Custom max_hosts via API

Key Capabilities

What Makes GTP IT Guardian Unique

True Multi-Tenancy

Complete data isolation at every layer - PostgreSQL, Prometheus, Grafana, and Nagios - with automatic provisioning on tenant creation.

AI Anomaly Engine

Isolation Forest models built per host from 24 hours of Prometheus data - scores every 5 minutes and fires webhooks on threshold breach.

Enterprise Security

JWT short-lived tokens, rotating refresh tokens, RBAC with four roles, Redis sliding-window rate limiting, and full HTTPS with HSTS.

Built-In Billing

Stripe integration with per-host pricing, overage tracking, usage snapshots, and invoice generation - all managed through the REST API.

Docker & Kubernetes Ready

Full Kubernetes manifests with HPA auto-scaling - deploy to EKS, GKE, AKS, or self-managed clusters with cert-manager TLS.

Zero-Downtime Registration

Host registration triggers Nagios config, Prometheus file-SD, and Grafana provisioning simultaneously - monitoring begins within 60 seconds, no restarts required.

Multi-Channel Alerting

Slack, Microsoft Teams, and SMTP email - configured via environment variables with Alertmanager routing, deduplication, and silence management.

Automated Backup & Recovery

Nightly cron-scheduled backups of PostgreSQL, Grafana, Prometheus rules, and Nagios config - with documented Docker volume recovery for full disaster recovery.

Platform Architecture

All external traffic enters through Nginx (TLS 1.2/1.3, HSTS, rate limiting) and routes to the FastAPI control plane, Grafana, and Nagios. Prometheus scrapes every registered host via file service-discovery.

Internet / Client Browser - HTTPS :443

▼

Nginx Reverse Proxy

TLS 1.2/1.3 · HSTS · CSP · Rate Limiting · X-Frame-Options

▼ routes to ▼

FastAPI Control Plane :8000

JWT · RBAC · Rate Limiting
Tenant & Host Orchestration

Grafana :3000

Per-tenant Orgs
Pre-provisioned Dashboards

Nagios Core :8080

Active Host & Service Checks
Alert State Engine

▼ backed by ▼

Prometheus :9090

TSDB 90d · File SD scraping

AI Engine :8001

Isolation Forest · DBSCAN

PostgreSQL :5432

Tenants · Users · Billing

Redis :6379

Rate limiter · Cache

MinIO :9000

ML model persistence

Monitored targets (node_exporter :9100 per host)

Host A (tenant-1)

Host B (tenant-1)

Host C (tenant-2)

Five Best-of-Breed Components, One Cohesive Platform

Full Logical Isolation Across Every Layer

AI-Powered Anomaly Detection & Capacity Forecasting

Comprehensive Host & Infrastructure Alert Rules

Alert Rules Reference

From Demo to Production in Minutes

Quick Start - Demo Stack

Services Available

Transparent Per-Host Pricing with Stripe Integration

What Makes GTP IT Guardian Unique

Platform Architecture

Deploy Enterprise IT Monitoring for Your Clients Today