Complete markdown files cleanup and organization
- Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards.
This commit is contained in:
363
scripts/cloudflare-tunnels/docs/MONITORING_GUIDE.md
Normal file
363
scripts/cloudflare-tunnels/docs/MONITORING_GUIDE.md
Normal file
@@ -0,0 +1,363 @@
|
||||
# Monitoring Guide
|
||||
|
||||
Complete guide for monitoring Cloudflare tunnels.
|
||||
|
||||
## Overview
|
||||
|
||||
Monitoring ensures your tunnels are healthy and alerts you to issues before they impact users.
|
||||
|
||||
## Monitoring Components
|
||||
|
||||
1. **Health Checks** - Verify tunnels are running
|
||||
2. **Connectivity Tests** - Verify DNS and HTTPS work
|
||||
3. **Log Monitoring** - Watch for errors
|
||||
4. **Alerting** - Notify on failures
|
||||
|
||||
## Quick Start
|
||||
|
||||
### One-Time Health Check
|
||||
|
||||
```bash
|
||||
./scripts/check-tunnel-health.sh
|
||||
```
|
||||
|
||||
### Continuous Monitoring
|
||||
|
||||
```bash
|
||||
# Foreground (see output)
|
||||
./scripts/monitor-tunnels.sh
|
||||
|
||||
# Background (daemon mode)
|
||||
./scripts/monitor-tunnels.sh --daemon
|
||||
```
|
||||
|
||||
## Health Check Script
|
||||
|
||||
The `check-tunnel-health.sh` script performs comprehensive checks:
|
||||
|
||||
### Checks Performed
|
||||
|
||||
1. **Service Status** - Is the systemd service running?
|
||||
2. **Log Errors** - Are there recent errors in logs?
|
||||
3. **DNS Resolution** - Does DNS resolve correctly?
|
||||
4. **HTTPS Connectivity** - Can we connect via HTTPS?
|
||||
5. **Internal Connectivity** - Can VMID 102 reach Proxmox hosts?
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
# Run health check
|
||||
./scripts/check-tunnel-health.sh
|
||||
|
||||
# Output shows:
|
||||
# - Service status for each tunnel
|
||||
# - DNS resolution status
|
||||
# - HTTPS connectivity
|
||||
# - Internal connectivity
|
||||
# - Recent errors
|
||||
```
|
||||
|
||||
### Example Output
|
||||
|
||||
```
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
Tunnel: ml110 (ml110-01.d-bis.org)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
[✓] Service is running
|
||||
[✓] No recent errors in logs
|
||||
[✓] DNS resolution: OK
|
||||
→ 104.16.132.229
|
||||
[✓] HTTPS connectivity: OK
|
||||
[✓] Internal connectivity to 192.168.11.10:8006: OK
|
||||
```
|
||||
|
||||
## Monitoring Script
|
||||
|
||||
The `monitor-tunnels.sh` script provides continuous monitoring:
|
||||
|
||||
### Features
|
||||
|
||||
- ✅ Continuous health checks
|
||||
- ✅ Automatic restart on failure
|
||||
- ✅ Alerting on failures
|
||||
- ✅ Logging to file
|
||||
- ✅ Daemon mode support
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
# Foreground mode (see output)
|
||||
./scripts/monitor-tunnels.sh
|
||||
|
||||
# Daemon mode (background)
|
||||
./scripts/monitor-tunnels.sh --daemon
|
||||
|
||||
# Check if daemon is running
|
||||
ps aux | grep monitor-tunnels
|
||||
|
||||
# Stop daemon
|
||||
kill $(cat /tmp/cloudflared-monitor.pid)
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Edit the script to customize:
|
||||
|
||||
```bash
|
||||
CHECK_INTERVAL=60 # Check every 60 seconds
|
||||
LOG_FILE="/var/log/cloudflared-monitor.log"
|
||||
ALERT_SCRIPT="./scripts/alert-tunnel-failure.sh"
|
||||
```
|
||||
|
||||
## Alerting
|
||||
|
||||
### Email Alerts
|
||||
|
||||
Configure email alerts in `alert-tunnel-failure.sh`:
|
||||
|
||||
```bash
|
||||
# Set email address
|
||||
export ALERT_EMAIL="admin@yourdomain.com"
|
||||
|
||||
# Ensure mail/sendmail is installed
|
||||
apt-get install -y mailutils
|
||||
```
|
||||
|
||||
### Webhook Alerts
|
||||
|
||||
Configure webhook alerts (Slack, Discord, etc.):
|
||||
|
||||
```bash
|
||||
# Set webhook URL
|
||||
export ALERT_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
```
|
||||
|
||||
### Test Alerts
|
||||
|
||||
```bash
|
||||
# Test alert script
|
||||
./scripts/alert-tunnel-failure.sh ml110 service_down
|
||||
```
|
||||
|
||||
## Log Monitoring
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
# All tunnels
|
||||
journalctl -u cloudflared-* -f
|
||||
|
||||
# Specific tunnel
|
||||
journalctl -u cloudflared-ml110 -f
|
||||
|
||||
# Last 100 lines
|
||||
journalctl -u cloudflared-ml110 -n 100
|
||||
|
||||
# Since specific time
|
||||
journalctl -u cloudflared-ml110 --since "1 hour ago"
|
||||
```
|
||||
|
||||
### Log Rotation
|
||||
|
||||
Systemd handles log rotation automatically. To customize:
|
||||
|
||||
```bash
|
||||
# Edit logrotate config
|
||||
sudo nano /etc/logrotate.d/cloudflared
|
||||
|
||||
# Add:
|
||||
/var/log/cloudflared/*.log {
|
||||
daily
|
||||
rotate 7
|
||||
compress
|
||||
delaycompress
|
||||
missingok
|
||||
notifempty
|
||||
}
|
||||
```
|
||||
|
||||
## Metrics
|
||||
|
||||
### Cloudflare Dashboard
|
||||
|
||||
View tunnel metrics in Cloudflare dashboard:
|
||||
|
||||
1. **Go to:** Zero Trust → Networks → Tunnels
|
||||
2. **Click on tunnel** to view:
|
||||
- Connection status
|
||||
- Uptime
|
||||
- Traffic statistics
|
||||
- Error rates
|
||||
|
||||
### Local Metrics
|
||||
|
||||
Tunnels expose metrics endpoints (if configured):
|
||||
|
||||
```bash
|
||||
# ml110 tunnel metrics
|
||||
curl http://127.0.0.1:9091/metrics
|
||||
|
||||
# r630-01 tunnel metrics
|
||||
curl http://127.0.0.1:9092/metrics
|
||||
|
||||
# r630-02 tunnel metrics
|
||||
curl http://127.0.0.1:9093/metrics
|
||||
```
|
||||
|
||||
## Automated Monitoring Setup
|
||||
|
||||
### Systemd Timer (Recommended)
|
||||
|
||||
Create a systemd timer for automated health checks:
|
||||
|
||||
```bash
|
||||
# Create timer unit
|
||||
sudo nano /etc/systemd/system/cloudflared-healthcheck.timer
|
||||
|
||||
# Add:
|
||||
[Unit]
|
||||
Description=Cloudflare Tunnel Health Check Timer
|
||||
Requires=cloudflared-healthcheck.service
|
||||
|
||||
[Timer]
|
||||
OnBootSec=5min
|
||||
OnUnitActiveSec=5min
|
||||
Unit=cloudflared-healthcheck.service
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
```bash
|
||||
# Create service unit
|
||||
sudo nano /etc/systemd/system/cloudflared-healthcheck.service
|
||||
|
||||
# Add:
|
||||
[Unit]
|
||||
Description=Cloudflare Tunnel Health Check
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/path/to/scripts/check-tunnel-health.sh
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
```
|
||||
|
||||
```bash
|
||||
# Enable and start
|
||||
sudo systemctl enable cloudflared-healthcheck.timer
|
||||
sudo systemctl start cloudflared-healthcheck.timer
|
||||
```
|
||||
|
||||
### Cron Job (Alternative)
|
||||
|
||||
```bash
|
||||
# Edit crontab
|
||||
crontab -e
|
||||
|
||||
# Add (check every 5 minutes):
|
||||
*/5 * * * * /path/to/scripts/check-tunnel-health.sh >> /var/log/tunnel-health.log 2>&1
|
||||
```
|
||||
|
||||
## Monitoring Best Practices
|
||||
|
||||
1. ✅ **Run health checks regularly** - At least every 5 minutes
|
||||
2. ✅ **Monitor logs** - Watch for errors
|
||||
3. ✅ **Set up alerts** - Get notified immediately on failures
|
||||
4. ✅ **Review metrics** - Track trends over time
|
||||
5. ✅ **Test alerts** - Verify alerting works
|
||||
6. ✅ **Document incidents** - Keep track of issues
|
||||
|
||||
## Integration with Monitoring Systems
|
||||
|
||||
### Prometheus
|
||||
|
||||
If using Prometheus, you can scrape tunnel metrics:
|
||||
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'cloudflared'
|
||||
static_configs:
|
||||
- targets: ['127.0.0.1:9091', '127.0.0.1:9092', '127.0.0.1:9093']
|
||||
```
|
||||
|
||||
### Grafana
|
||||
|
||||
Create dashboards in Grafana:
|
||||
- Tunnel uptime
|
||||
- Connection status
|
||||
- Error rates
|
||||
- Response times
|
||||
|
||||
### Nagios/Icinga
|
||||
|
||||
Create service checks:
|
||||
```bash
|
||||
# Check service status
|
||||
check_nrpe -H localhost -c check_cloudflared_ml110
|
||||
|
||||
# Check connectivity
|
||||
check_http -H ml110-01.d-bis.org -S
|
||||
```
|
||||
|
||||
## Troubleshooting Monitoring
|
||||
|
||||
### Health Check Fails
|
||||
|
||||
```bash
|
||||
# Run manually with verbose output
|
||||
bash -x ./scripts/check-tunnel-health.sh
|
||||
|
||||
# Check individual components
|
||||
systemctl status cloudflared-ml110
|
||||
dig ml110-01.d-bis.org
|
||||
curl -I https://ml110-01.d-bis.org
|
||||
```
|
||||
|
||||
### Monitor Script Not Working
|
||||
|
||||
```bash
|
||||
# Check if daemon is running
|
||||
ps aux | grep monitor-tunnels
|
||||
|
||||
# Check log file
|
||||
tail -f /var/log/cloudflared-monitor.log
|
||||
|
||||
# Run in foreground to see errors
|
||||
./scripts/monitor-tunnels.sh
|
||||
```
|
||||
|
||||
### Alerts Not Sending
|
||||
|
||||
```bash
|
||||
# Test alert script
|
||||
./scripts/alert-tunnel-failure.sh ml110 service_down
|
||||
|
||||
# Check email configuration
|
||||
echo "Test" | mail -s "Test" admin@yourdomain.com
|
||||
|
||||
# Check webhook
|
||||
curl -X POST -H "Content-Type: application/json" \
|
||||
-d '{"text":"test"}' $ALERT_WEBHOOK
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After setting up monitoring:
|
||||
|
||||
1. ✅ Verify health checks run successfully
|
||||
2. ✅ Test alerting (trigger a test failure)
|
||||
3. ✅ Set up log aggregation (if needed)
|
||||
4. ✅ Create dashboards (if using Grafana)
|
||||
5. ✅ Document monitoring procedures
|
||||
|
||||
## Support
|
||||
|
||||
For monitoring issues:
|
||||
1. Check [Troubleshooting Guide](TROUBLESHOOTING.md)
|
||||
2. Review script logs
|
||||
3. Test components individually
|
||||
4. Check systemd service status
|
||||
|
||||
Reference in New Issue
Block a user