Complete markdown files cleanup and organization

- Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards.
2026-01-06 01:46:25 -08:00
parent 1edcec953c
commit cb47cce074
1327 changed files with 217220 additions and 801 deletions
--- a/scripts/cloudflare-tunnels/docs/MONITORING_GUIDE.md
+++ b/scripts/cloudflare-tunnels/docs/MONITORING_GUIDE.md
@@ -0,0 +1,363 @@
+# Monitoring Guide
+
+Complete guide for monitoring Cloudflare tunnels.
+
+## Overview
+
+Monitoring ensures your tunnels are healthy and alerts you to issues before they impact users.
+
+## Monitoring Components
+
+1. **Health Checks** - Verify tunnels are running
+2. **Connectivity Tests** - Verify DNS and HTTPS work
+3. **Log Monitoring** - Watch for errors
+4. **Alerting** - Notify on failures
+
+## Quick Start
+
+### One-Time Health Check
+
+```bash
+./scripts/check-tunnel-health.sh
+```
+
+### Continuous Monitoring
+
+```bash
+# Foreground (see output)
+./scripts/monitor-tunnels.sh
+
+# Background (daemon mode)
+./scripts/monitor-tunnels.sh --daemon
+```
+
+## Health Check Script
+
+The `check-tunnel-health.sh` script performs comprehensive checks:
+
+### Checks Performed
+
+1. **Service Status** - Is the systemd service running?
+2. **Log Errors** - Are there recent errors in logs?
+3. **DNS Resolution** - Does DNS resolve correctly?
+4. **HTTPS Connectivity** - Can we connect via HTTPS?
+5. **Internal Connectivity** - Can VMID 102 reach Proxmox hosts?
+
+### Usage
+
+```bash
+# Run health check
+./scripts/check-tunnel-health.sh
+
+# Output shows:
+# - Service status for each tunnel
+# - DNS resolution status
+# - HTTPS connectivity
+# - Internal connectivity
+# - Recent errors
+```
+
+### Example Output
+
+```
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+Tunnel: ml110 (ml110-01.d-bis.org)
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+[✓] Service is running
+[✓] No recent errors in logs
+[✓] DNS resolution: OK
+  → 104.16.132.229
+[✓] HTTPS connectivity: OK
+[✓] Internal connectivity to 192.168.11.10:8006: OK
+```
+
+## Monitoring Script
+
+The `monitor-tunnels.sh` script provides continuous monitoring:
+
+### Features
+
+- ✅ Continuous health checks
+- ✅ Automatic restart on failure
+- ✅ Alerting on failures
+- ✅ Logging to file
+- ✅ Daemon mode support
+
+### Usage
+
+```bash
+# Foreground mode (see output)
+./scripts/monitor-tunnels.sh
+
+# Daemon mode (background)
+./scripts/monitor-tunnels.sh --daemon
+
+# Check if daemon is running
+ps aux | grep monitor-tunnels
+
+# Stop daemon
+kill $(cat /tmp/cloudflared-monitor.pid)
+```
+
+### Configuration
+
+Edit the script to customize:
+
+```bash
+CHECK_INTERVAL=60        # Check every 60 seconds
+LOG_FILE="/var/log/cloudflared-monitor.log"
+ALERT_SCRIPT="./scripts/alert-tunnel-failure.sh"
+```
+
+## Alerting
+
+### Email Alerts
+
+Configure email alerts in `alert-tunnel-failure.sh`:
+
+```bash
+# Set email address
+export ALERT_EMAIL="admin@yourdomain.com"
+
+# Ensure mail/sendmail is installed
+apt-get install -y mailutils
+```
+
+### Webhook Alerts
+
+Configure webhook alerts (Slack, Discord, etc.):
+
+```bash
+# Set webhook URL
+export ALERT_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
+```
+
+### Test Alerts
+
+```bash
+# Test alert script
+./scripts/alert-tunnel-failure.sh ml110 service_down
+```
+
+## Log Monitoring
+
+### View Logs
+
+```bash
+# All tunnels
+journalctl -u cloudflared-* -f
+
+# Specific tunnel
+journalctl -u cloudflared-ml110 -f
+
+# Last 100 lines
+journalctl -u cloudflared-ml110 -n 100
+
+# Since specific time
+journalctl -u cloudflared-ml110 --since "1 hour ago"
+```
+
+### Log Rotation
+
+Systemd handles log rotation automatically. To customize:
+
+```bash
+# Edit logrotate config
+sudo nano /etc/logrotate.d/cloudflared
+
+# Add:
+/var/log/cloudflared/*.log {
+    daily
+    rotate 7
+    compress
+    delaycompress
+    missingok
+    notifempty
+}
+```
+
+## Metrics
+
+### Cloudflare Dashboard
+
+View tunnel metrics in Cloudflare dashboard:
+
+1. **Go to:** Zero Trust → Networks → Tunnels
+2. **Click on tunnel** to view:
+   - Connection status
+   - Uptime
+   - Traffic statistics
+   - Error rates
+
+### Local Metrics
+
+Tunnels expose metrics endpoints (if configured):
+
+```bash
+# ml110 tunnel metrics
+curl http://127.0.0.1:9091/metrics
+
+# r630-01 tunnel metrics
+curl http://127.0.0.1:9092/metrics
+
+# r630-02 tunnel metrics
+curl http://127.0.0.1:9093/metrics
+```
+
+## Automated Monitoring Setup
+
+### Systemd Timer (Recommended)
+
+Create a systemd timer for automated health checks:
+
+```bash
+# Create timer unit
+sudo nano /etc/systemd/system/cloudflared-healthcheck.timer
+
+# Add:
+[Unit]
+Description=Cloudflare Tunnel Health Check Timer
+Requires=cloudflared-healthcheck.service
+
+[Timer]
+OnBootSec=5min
+OnUnitActiveSec=5min
+Unit=cloudflared-healthcheck.service
+
+[Install]
+WantedBy=timers.target
+```
+
+```bash
+# Create service unit
+sudo nano /etc/systemd/system/cloudflared-healthcheck.service
+
+# Add:
+[Unit]
+Description=Cloudflare Tunnel Health Check
+After=network.target
+
+[Service]
+Type=oneshot
+ExecStart=/path/to/scripts/check-tunnel-health.sh
+StandardOutput=journal
+StandardError=journal
+```
+
+```bash
+# Enable and start
+sudo systemctl enable cloudflared-healthcheck.timer
+sudo systemctl start cloudflared-healthcheck.timer
+```
+
+### Cron Job (Alternative)
+
+```bash
+# Edit crontab
+crontab -e
+
+# Add (check every 5 minutes):
+*/5 * * * * /path/to/scripts/check-tunnel-health.sh >> /var/log/tunnel-health.log 2>&1
+```
+
+## Monitoring Best Practices
+
+1. ✅ **Run health checks regularly** - At least every 5 minutes
+2. ✅ **Monitor logs** - Watch for errors
+3. ✅ **Set up alerts** - Get notified immediately on failures
+4. ✅ **Review metrics** - Track trends over time
+5. ✅ **Test alerts** - Verify alerting works
+6. ✅ **Document incidents** - Keep track of issues
+
+## Integration with Monitoring Systems
+
+### Prometheus
+
+If using Prometheus, you can scrape tunnel metrics:
+
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'cloudflared'
+    static_configs:
+      - targets: ['127.0.0.1:9091', '127.0.0.1:9092', '127.0.0.1:9093']
+```
+
+### Grafana
+
+Create dashboards in Grafana:
+- Tunnel uptime
+- Connection status
+- Error rates
+- Response times
+
+### Nagios/Icinga
+
+Create service checks:
+```bash
+# Check service status
+check_nrpe -H localhost -c check_cloudflared_ml110
+
+# Check connectivity
+check_http -H ml110-01.d-bis.org -S
+```
+
+## Troubleshooting Monitoring
+
+### Health Check Fails
+
+```bash
+# Run manually with verbose output
+bash -x ./scripts/check-tunnel-health.sh
+
+# Check individual components
+systemctl status cloudflared-ml110
+dig ml110-01.d-bis.org
+curl -I https://ml110-01.d-bis.org
+```
+
+### Monitor Script Not Working
+
+```bash
+# Check if daemon is running
+ps aux | grep monitor-tunnels
+
+# Check log file
+tail -f /var/log/cloudflared-monitor.log
+
+# Run in foreground to see errors
+./scripts/monitor-tunnels.sh
+```
+
+### Alerts Not Sending
+
+```bash
+# Test alert script
+./scripts/alert-tunnel-failure.sh ml110 service_down
+
+# Check email configuration
+echo "Test" | mail -s "Test" admin@yourdomain.com
+
+# Check webhook
+curl -X POST -H "Content-Type: application/json" \
+  -d '{"text":"test"}' $ALERT_WEBHOOK
+```
+
+## Next Steps
+
+After setting up monitoring:
+
+1. ✅ Verify health checks run successfully
+2. ✅ Test alerting (trigger a test failure)
+3. ✅ Set up log aggregation (if needed)
+4. ✅ Create dashboards (if using Grafana)
+5. ✅ Document monitoring procedures
+
+## Support
+
+For monitoring issues:
+1. Check [Troubleshooting Guide](TROUBLESHOOTING.md)
+2. Review script logs
+3. Test components individually
+4. Check systemd service status
+