Production Operations Runbook: Zero-TrustML Credits DNA¶
Version: 1.0 Last Updated: October 1, 2025 Status: Production Ready
Overview¶
This runbook provides operational procedures for running Zero-TrustML Credits DNA in production. It covers monitoring, alerting, incident response, and recovery procedures.
Table of Contents¶
- System Architecture
- Monitoring Setup
- Key Performance Indicators (KPIs)
- Alerting Configuration
- Incident Response
- Recovery Procedures
- Maintenance Operations
- Troubleshooting Guide
System Architecture¶
Components¶
┌─────────────────────────────────────────────────────────┐
│ Zero-TrustML Credits System │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Zero-TrustML │ │ Credits │ │ Holochain │ │
│ │ Reputation │─→│ Integration │─→│ Credits │ │
│ │ System │ │ │ │ Bridge │ │
│ └─────────────┘ └──────────────┘ └──────────────┘ │
│ ↓ ↓ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Production Monitor │ │
│ │ • Metrics Collector │ │
│ │ • Alert Manager │ │
│ │ • Health Checks │ │
│ └─────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Monitoring Dashboard │ │
│ │ • Grafana (metrics visualization) │ │
│ │ • Prometheus (time-series storage) │ │
│ │ • AlertManager (notifications) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Dependencies¶
- Python 3.13.7+: Runtime environment
- Holochain Conductor: WebSocket API for credit storage
- PostgreSQL (optional): Persistent storage backend
- Prometheus (recommended): Metrics storage
- Grafana (recommended): Metrics visualization
Monitoring Setup¶
1. Start Production Monitor¶
from src.monitoring.production_monitor import ProductionMonitor
import asyncio
# Initialize monitor with 60-second check interval
monitor = ProductionMonitor(check_interval_seconds=60)
# Start monitoring in background
monitor_task = asyncio.create_task(monitor.start())
# Monitor will run continuously, logging metrics and alerts
2. Integration with Zero-TrustML Credits¶
from zerotrustml_credits_integration import Zero-TrustMLCreditsIntegration
from src.monitoring.production_monitor import ProductionMonitor
# Initialize both systems
monitor = ProductionMonitor()
integration = Zero-TrustMLCreditsIntegration(bridge)
# Hook monitoring into credit events
async def on_credit_issued(node_id, credits, event_type, response_time):
# Issue credit
credit_id = await integration.on_quality_gradient(...)
# Record metrics
monitor.metrics_collector.record_credit_issued(
node_id=node_id,
credits=credits,
event_type=event_type,
response_time_ms=response_time
)
async def on_byzantine_detected(detector_id, detected_id, evidence):
# Handle detection
await integration.on_byzantine_detection(...)
# Record metrics
monitor.metrics_collector.record_byzantine_detection(
detector_id=detector_id,
detected_id=detected_id,
evidence=evidence
)
3. Export Metrics to Prometheus (Optional)¶
from prometheus_client import Gauge, Counter, Histogram, start_http_server
# Define metrics
credits_issued = Counter('zerotrustml_credits_issued_total', 'Total credits issued')
response_time = Histogram('zerotrustml_response_time_seconds', 'Response time')
byzantine_detections = Counter('zerotrustml_byzantine_detections_total', 'Byzantine detections')
active_nodes = Gauge('zerotrustml_active_nodes', 'Active nodes')
# Start Prometheus exporter on port 8000
start_http_server(8000)
# Update metrics from monitor
def export_metrics():
metrics = monitor.metrics_collector.get_current_metrics()
active_nodes.set(metrics.active_nodes)
# ... update other metrics
Key Performance Indicators (KPIs)¶
Throughput KPIs¶
| Metric | Target | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Events per minute | 10-50 | <5 | <2 |
| Credits per hour | 1,000-10,000 | <500 | <100 |
| Average response time | <1000ms | >3000ms | >5000ms |
| P99 response time | <3000ms | >5000ms | >10000ms |
Security KPIs¶
| Metric | Target | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Byzantine detection rate | >95% | <90% | <85% |
| False positive rate | <1% | >2% | >5% |
| Detection latency | <5s | >10s | >30s |
Availability KPIs¶
| Metric | Target | Warning Threshold | Critical Threshold |
|---|---|---|---|
| System uptime | >99.9% | <99.5% | <99% |
| Healthy node % | >95% | <90% | <85% |
| Error rate | <0.1% | >1% | >5% |
| Rate limit violations | <10/hour | >50/hour | >100/hour |
Capacity KPIs¶
| Metric | Target | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Active nodes | 50-100 | >150 | >200 |
| Memory usage | <2GB | >4GB | >6GB |
| Storage growth | <100MB/day | >500MB/day | >1GB/day |
Alerting Configuration¶
Alert Severity Levels¶
- INFO: Informational, no action required
- WARNING: Action required within 24 hours
- CRITICAL: Immediate action required
Alert Categories¶
1. Performance Alerts¶
WARNING: High Average Response Time - Trigger: Avg response time > 3000ms - Impact: Degraded user experience - Resolution: 1. Check system load: top, htop 2. Check Holochain conductor status 3. Review recent code changes 4. Scale horizontally if sustained
CRITICAL: Very High P99 Response Time - Trigger: P99 response time > 10000ms - Impact: Some requests timing out - Resolution: 1. Immediate investigation required 2. Check for database bottlenecks 3. Review slow query logs 4. Consider circuit breaker activation
2. Security Alerts¶
WARNING: Low Detection Rate - Trigger: Byzantine detection rate < 90% - Impact: Potential undetected malicious activity - Resolution: 1. Verify PoGQ validation system status 2. Check validator configuration 3. Review recent Byzantine attack patterns 4. Adjust detection thresholds if needed
INFO: High Detection Activity - Trigger: >10 Byzantine detections in 1 hour - Impact: Possible coordinated attack - Resolution: 1. Monitor pattern of detections 2. Verify detections are valid (not false positives) 3. Consider temporarily increasing reputation penalties
3. Availability Alerts¶
CRITICAL: High Error Rate - Trigger: Error rate > 5% - Impact: System unreliable, users affected - Resolution: 1. Check application logs: tail -f /var/log/zerotrustml/app.log 2. Verify Holochain connectivity: hc sandbox call 3. Check database connection pool 4. Activate backup systems if available
CRITICAL: Low Healthy Node Percentage - Trigger: Healthy nodes < 85% of active nodes - Impact: Network degradation, reduced redundancy - Resolution: 1. Identify unhealthy nodes: monitor.get_status_report() 2. Check network connectivity 3. Review node logs for failures 4. Restart unhealthy nodes if needed
4. Capacity Alerts¶
WARNING: High Rate Limit Violations - Trigger: >50 rate limit violations per hour - Impact: Potential spam attack or legitimate growth - Resolution: 1. Identify nodes hitting limits 2. Analyze patterns (spam vs legitimate) 3. Adjust rate limits if legitimate growth 4. Ban nodes if confirmed spam attack
WARNING: High Memory Usage - Trigger: Memory usage > 4GB - Impact: Risk of OOM crashes - Resolution: 1. Check for memory leaks: ps aux --sort=-%mem 2. Review metrics collector window size 3. Clear old metrics if safe 4. Scale up instance if sustained
Incident Response¶
Incident Response Process¶
┌──────────────┐
│ Detect │ ← Alert triggered or user report
└──────┬───────┘
↓
┌──────────────┐
│ Assess │ ← Determine severity and impact
└──────┬───────┘
↓
┌──────────────┐
│ Respond │ ← Execute recovery procedures
└──────┬───────┘
↓
┌──────────────┐
│ Resolve │ ← Verify system restored
└──────┬───────┘
↓
┌──────────────┐
│ Post-Mortem │ ← Document and improve
└──────────────┘
Common Incidents¶
Incident 1: System Unresponsive¶
Symptoms: - Response times > 10 seconds - Timeouts on API calls - No credits being issued
Diagnosis:
# Check system status
python -c "from src.monitoring.production_monitor import *; monitor = ProductionMonitor(); print(monitor.get_status_report())"
# Check Holochain conductor
hc sandbox list
hc sandbox call --running
# Check resource usage
top
df -h
Resolution: 1. If Holochain down: Restart conductor
-
If database connection issue: Restart database
-
If application hung: Restart application
-
If resource exhaustion: Scale up or free resources
Incident 2: Byzantine Detection Failure¶
Symptoms: - Detection rate suddenly drops below 85% - Known malicious nodes not being caught - Reputation system not updating
Diagnosis:
# Check recent detections
from zerotrustml_credits_integration import Zero-TrustMLCreditsIntegration
integration = Zero-TrustMLCreditsIntegration(bridge)
audit = await integration.get_audit_trail("all")
# Filter Byzantine detection events
detections = [e for e in audit if e['event_type'] == 'byzantine_detection']
print(f"Recent detections: {len(detections)}")
Resolution: 1. Verify PoGQ validation: Check quality scores are being calculated 2. Check validator configuration: Ensure validators are active 3. Review detection thresholds: May need adjustment 4. Check for validator Byzantine nodes: Validators themselves may be compromised
Incident 3: Data Corruption¶
Symptoms: - Inconsistent credit balances - Audit trail gaps - Failed validations
Diagnosis:
# Check Holochain DHT consistency
hc sandbox call zerotrustml_credits get_all_credits
# Check database integrity (if using PostgreSQL)
psql -U zerotrustml -d credits -c "SELECT COUNT(*) FROM credits WHERE created_at > NOW() - INTERVAL '1 hour';"
Resolution: 1. If DHT inconsistency: Run DHT repair
-
If database corruption: Restore from backup
-
If irrecoverable: Manual reconciliation required
Recovery Procedures¶
Backup and Restore¶
1. Database Backup (PostgreSQL)¶
Daily Backup:
#!/bin/bash
# /opt/zerotrustml/scripts/backup.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/zerotrustml"
DB_NAME="credits"
# Create backup
pg_dump -U zerotrustml $DB_NAME > $BACKUP_DIR/credits_$DATE.sql
# Compress
gzip $BACKUP_DIR/credits_$DATE.sql
# Keep last 30 days
find $BACKUP_DIR -name "credits_*.sql.gz" -mtime +30 -delete
echo "Backup completed: credits_$DATE.sql.gz"
Schedule via cron:
Restore from Backup:
# Stop application
sudo systemctl stop zerotrustml-credits
# Restore database
gunzip -c /backups/zerotrustml/credits_20251001_020000.sql.gz | psql -U zerotrustml credits
# Restart application
sudo systemctl start zerotrustml-credits
# Verify
python -c "from src.monitoring.production_monitor import *; monitor = ProductionMonitor(); print(monitor.get_status_report())"
2. Holochain DHT Backup¶
Export DHT State:
# Export all credits from DHT
hc sandbox call zerotrustml_credits export_state > /backups/holochain/dht_export_$(date +%Y%m%d).json
Import DHT State:
# Re-import state (requires custom zome function)
cat /backups/holochain/dht_export_20251001.json | hc sandbox call zerotrustml_credits import_state
Disaster Recovery¶
Full System Failure¶
Prerequisites: - Recent database backup - Recent DHT export - System configuration backup
Recovery Steps:
-
Provision new infrastructure
-
Restore database
-
Restore Holochain DHT
-
Deploy application
-
Verify recovery
Expected Recovery Time: 1-2 hours
Maintenance Operations¶
Routine Maintenance¶
Daily Tasks (Automated)¶
- ✅ Database backup (2 AM)
- ✅ DHT export (2:30 AM)
- ✅ Log rotation (3 AM)
- ✅ Metrics cleanup (4 AM)
Weekly Tasks (Manual)¶
Monday: Review system health
# Generate weekly report
python scripts/generate_weekly_report.py
# Review alerts from past week
python -c "from src.monitoring.production_monitor import *; monitor = ProductionMonitor(); print(monitor.alert_manager.alert_history[-100:])"
# Check for security updates
nix flake update
Wednesday: Performance review
# Analyze response times
python scripts/analyze_performance.py --days 7
# Review Byzantine detection patterns
python scripts/analyze_detections.py --days 7
Friday: Capacity planning
Monthly Tasks¶
- Review and update alert thresholds
- Test disaster recovery procedures
- Update documentation
- Conduct security audit
- Review rate limits
Scaling Operations¶
Horizontal Scaling (Add Nodes)¶
# 1. Provision new node
# 2. Install dependencies
# 3. Configure to join network
# 4. Start monitoring
# Verify node joined
python -c "from src.monitoring.production_monitor import *; monitor = ProductionMonitor(); print(monitor.metrics_collector.active_nodes)"
Vertical Scaling (Upgrade Resources)¶
# 1. Schedule maintenance window
# 2. Backup current state
# 3. Stop application
sudo systemctl stop zerotrustml-credits
# 4. Upgrade instance (more CPU/RAM/storage)
# 5. Restart application
sudo systemctl start zerotrustml-credits
# 6. Verify performance improvement
Troubleshooting Guide¶
Problem: High Latency¶
Symptoms: Response times > 3 seconds
Checks: 1. System resources: top, htop, iostat 2. Database performance: Check slow query log 3. Network latency: ping Holochain conductor 4. Cache hit rate: Review metrics
Solutions: - Optimize database queries - Increase cache size - Add indexes - Scale horizontally
Problem: Memory Leak¶
Symptoms: Memory usage continuously growing
Checks: 1. Monitor memory over time: ps aux --sort=-%mem 2. Review metrics collector window size 3. Check for lingering connections
Solutions:
# Reduce metrics window size
monitor = ProductionMonitor()
monitor.metrics_collector.window_size = 30 # Reduce from 60 minutes
# Clear old metrics
monitor.metrics_collector.credit_events.clear()
Problem: Connection Failures¶
Symptoms: "Failed to connect to Holochain"
Checks: 1. Holochain conductor running: hc sandbox list 2. WebSocket port accessible: telnet localhost 8888 3. Firewall rules: sudo iptables -L
Solutions:
# Restart conductor
hc sandbox clean
hc sandbox run
# Check logs
tail -f ~/.holochain/logs/conductor.log
Problem: Unexpected Byzantine Detections¶
Symptoms: Legitimate nodes flagged as Byzantine
Checks: 1. Review PoGQ scores: Check if legitimately low quality 2. Check detection evidence: Review anomaly patterns 3. Verify validator accuracy: Cross-check multiple validators
Solutions: - Adjust PoGQ thresholds if too strict - Review detection algorithm - Manually restore reputation if false positive
Contact Information¶
On-Call Rotation¶
| Role | Primary | Backup |
|---|---|---|
| Production Engineer | [email protected] | [email protected] |
| Security Engineer | [email protected] | - |
| Database Admin | [email protected] | [email protected] |
Escalation Path¶
- Level 1: On-call engineer (initial response within 15 minutes)
- Level 2: Senior engineer (escalate if not resolved in 1 hour)
- Level 3: System architect (escalate if critical and not resolved in 4 hours)
Appendix¶
Useful Commands¶
# Check system status
systemctl status zerotrustml-credits
# View application logs
tail -f /var/log/zerotrustml/app.log
# Monitor in real-time
watch -n 5 'python -c "from src.monitoring.production_monitor import *; monitor = ProductionMonitor(); print(monitor.get_status_report())"'
# Export metrics
curl http://localhost:8000/metrics
# Check Holochain DHT
hc sandbox call zerotrustml_credits get_network_stats
Log Locations¶
- Application logs:
/var/log/zerotrustml/app.log - Monitoring logs:
/var/log/zerotrustml/monitor.log - Holochain conductor:
~/.holochain/logs/conductor.log - PostgreSQL logs:
/var/log/postgresql/postgresql-15-main.log
Configuration Files¶
- Application config:
/etc/zerotrustml/config.yaml - Monitoring config:
/etc/zerotrustml/monitoring.yaml - Holochain conductor:
~/.holochain/conductor-config.yaml
Last reviewed: October 1, 2025 Next review: November 1, 2025