Files
2026-02-09 21:51:45 -08:00

285 lines
6.0 KiB
Markdown

# Operational Runbook
## Table of Contents
1. [System Overview](#system-overview)
2. [Monitoring & Alerts](#monitoring--alerts)
3. [Common Operations](#common-operations)
4. [Troubleshooting](#troubleshooting)
5. [Disaster Recovery](#disaster-recovery)
## System Overview
### Architecture
- **Application**: Node.js/TypeScript Express server
- **Database**: PostgreSQL 14+
- **Cache/Sessions**: Redis (optional)
- **Metrics**: Prometheus format on `/metrics`
- **Health Check**: `/health` endpoint
### Key Endpoints
- API Base: `/api/v1`
- Terminal UI: `/`
- Health: `/health`
- Metrics: `/metrics`
- API Docs: `/api-docs`
## Monitoring & Alerts
### Key Metrics to Monitor
#### Payment Metrics
- `payments_initiated_total` - Total payments initiated
- `payments_approved_total` - Total payments approved
- `payments_completed_total` - Total payments completed
- `payments_failed_total` - Total payments failed
- `payment_processing_duration_seconds` - Processing latency
#### TLS Metrics
- `tls_connections_active` - Active TLS connections
- `tls_connection_errors_total` - TLS connection errors
- `tls_acks_received_total` - ACKs received
- `tls_nacks_received_total` - NACKs received
#### System Metrics
- `http_request_duration_seconds` - HTTP request latency
- `process_cpu_user_seconds_total` - CPU usage
- `process_resident_memory_bytes` - Memory usage
### Alert Thresholds
**Critical Alerts:**
- Payment failure rate > 5% in 5 minutes
- TLS connection errors > 10 in 1 minute
- Database connection pool exhaustion
- Health check failing
**Warning Alerts:**
- Payment processing latency p95 > 30s
- Unmatched reconciliation items > 10
- TLS circuit breaker OPEN state
## Common Operations
### Start System
```bash
# Using npm
npm start
# Using Docker Compose
docker-compose up -d
# Verify health
curl http://localhost:3000/health
```
### Stop System
```bash
# Graceful shutdown
docker-compose down
# Or send SIGTERM to process
kill -TERM <pid>
```
### Check System Status
```bash
# Health check
curl http://localhost:3000/health
# Metrics
curl http://localhost:3000/metrics
# Database connection
psql $DATABASE_URL -c "SELECT 1"
```
### View Logs
```bash
# Application logs
tail -f logs/application-*.log
# Docker logs
docker-compose logs -f app
# Audit logs (database)
psql $DATABASE_URL -c "SELECT * FROM audit_logs ORDER BY timestamp DESC LIMIT 100"
```
### Run Reconciliation
```bash
# Via API
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/daily?date=2024-01-01" \
-H "Authorization: Bearer <token>"
# Check aging items
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/aging?days=1" \
-H "Authorization: Bearer <token>"
```
### Database Operations
```bash
# Run migrations
npm run migrate
# Rollback last migration
npm run migrate:rollback
# Seed operators
npm run seed
# Backup database
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql
# Restore database
psql $DATABASE_URL < backup_20240101.sql
```
## Troubleshooting
### Payment Stuck in Processing
**Symptoms:**
- Payment status is `APPROVED` but not progressing
- No ledger posting or message generation
**Diagnosis:**
```sql
SELECT id, status, created_at, updated_at
FROM payments
WHERE status = 'APPROVED'
AND updated_at < NOW() - INTERVAL '5 minutes';
```
**Resolution:**
1. Check application logs for errors
2. Verify compliance screening status
3. Check ledger adapter connectivity
4. Manually trigger processing if needed
### TLS Connection Issues
**Symptoms:**
- `tls_connection_errors_total` increasing
- Circuit breaker in OPEN state
- Messages not transmitting
**Diagnosis:**
```bash
# Check TLS pool stats
curl http://localhost:3000/metrics | grep tls
# Check receiver connectivity
openssl s_client -connect 172.67.157.88:443 -servername devmindgroup.com
```
**Resolution:**
1. Verify receiver IP/port configuration
2. Check certificate validity
3. Verify network connectivity
4. Review TLS pool logs
5. Reset circuit breaker if needed
### Database Connection Issues
**Symptoms:**
- Health check shows database error
- High connection pool usage
- Query timeouts
**Diagnosis:**
```sql
-- Check active connections
SELECT count(*) FROM pg_stat_activity;
-- Check connection pool stats
SELECT * FROM pg_stat_database WHERE datname = 'dbis_core';
```
**Resolution:**
1. Increase connection pool size in config
2. Check for long-running queries
3. Restart database if needed
4. Review connection pool settings
### Reconciliation Exceptions
**Symptoms:**
- High number of unmatched payments
- Aging items accumulating
**Resolution:**
1. Review reconciliation report
2. Check exception queue
3. Manually reconcile exceptions
4. Investigate root cause (missing ACK, ledger mismatch, etc.)
## Disaster Recovery
### Backup Procedures
**Daily Backups:**
```bash
# Database backup
pg_dump $DATABASE_URL | gzip > backups/dbis_core_$(date +%Y%m%d).sql.gz
# Audit logs export (for compliance)
psql $DATABASE_URL -c "\COPY audit_logs TO 'audit_logs_$(date +%Y%m%d).csv' CSV HEADER"
```
### Recovery Procedures
**Database Recovery:**
```bash
# Stop application
docker-compose stop app
# Restore database
gunzip < backups/dbis_core_20240101.sql.gz | psql $DATABASE_URL
# Run migrations
npm run migrate
# Restart application
docker-compose start app
```
### Data Retention
- **Audit Logs**: 7-10 years (configurable)
- **Payment Records**: Indefinite (archived after 7 years)
- **Application Logs**: 30 days
### Failover Procedures
1. **Application Failover:**
- Deploy to secondary server
- Update load balancer
- Verify health checks
2. **Database Failover:**
- Promote replica to primary
- Update DATABASE_URL
- Restart application
## Emergency Contacts
- **System Administrator**: [Contact]
- **Database Administrator**: [Contact]
- **Security Team**: [Contact]
- **On-Call Engineer**: [Contact]
## Change Management
All changes to production must:
1. Be tested in staging environment
2. Have rollback plan documented
3. Be approved by technical lead
4. Be performed during maintenance window
5. Be monitored post-deployment