Initial commit: add .gitignore and README
This commit is contained in:
284
docs/operations/runbook.md
Normal file
284
docs/operations/runbook.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# Operational Runbook
|
||||
|
||||
## Table of Contents
|
||||
1. [System Overview](#system-overview)
|
||||
2. [Monitoring & Alerts](#monitoring--alerts)
|
||||
3. [Common Operations](#common-operations)
|
||||
4. [Troubleshooting](#troubleshooting)
|
||||
5. [Disaster Recovery](#disaster-recovery)
|
||||
|
||||
## System Overview
|
||||
|
||||
### Architecture
|
||||
- **Application**: Node.js/TypeScript Express server
|
||||
- **Database**: PostgreSQL 14+
|
||||
- **Cache/Sessions**: Redis (optional)
|
||||
- **Metrics**: Prometheus format on `/metrics`
|
||||
- **Health Check**: `/health` endpoint
|
||||
|
||||
### Key Endpoints
|
||||
- API Base: `/api/v1`
|
||||
- Terminal UI: `/`
|
||||
- Health: `/health`
|
||||
- Metrics: `/metrics`
|
||||
- API Docs: `/api-docs`
|
||||
|
||||
## Monitoring & Alerts
|
||||
|
||||
### Key Metrics to Monitor
|
||||
|
||||
#### Payment Metrics
|
||||
- `payments_initiated_total` - Total payments initiated
|
||||
- `payments_approved_total` - Total payments approved
|
||||
- `payments_completed_total` - Total payments completed
|
||||
- `payments_failed_total` - Total payments failed
|
||||
- `payment_processing_duration_seconds` - Processing latency
|
||||
|
||||
#### TLS Metrics
|
||||
- `tls_connections_active` - Active TLS connections
|
||||
- `tls_connection_errors_total` - TLS connection errors
|
||||
- `tls_acks_received_total` - ACKs received
|
||||
- `tls_nacks_received_total` - NACKs received
|
||||
|
||||
#### System Metrics
|
||||
- `http_request_duration_seconds` - HTTP request latency
|
||||
- `process_cpu_user_seconds_total` - CPU usage
|
||||
- `process_resident_memory_bytes` - Memory usage
|
||||
|
||||
### Alert Thresholds
|
||||
|
||||
**Critical Alerts:**
|
||||
- Payment failure rate > 5% in 5 minutes
|
||||
- TLS connection errors > 10 in 1 minute
|
||||
- Database connection pool exhaustion
|
||||
- Health check failing
|
||||
|
||||
**Warning Alerts:**
|
||||
- Payment processing latency p95 > 30s
|
||||
- Unmatched reconciliation items > 10
|
||||
- TLS circuit breaker OPEN state
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Start System
|
||||
|
||||
```bash
|
||||
# Using npm
|
||||
npm start
|
||||
|
||||
# Using Docker Compose
|
||||
docker-compose up -d
|
||||
|
||||
# Verify health
|
||||
curl http://localhost:3000/health
|
||||
```
|
||||
|
||||
### Stop System
|
||||
|
||||
```bash
|
||||
# Graceful shutdown
|
||||
docker-compose down
|
||||
|
||||
# Or send SIGTERM to process
|
||||
kill -TERM <pid>
|
||||
```
|
||||
|
||||
### Check System Status
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:3000/health
|
||||
|
||||
# Metrics
|
||||
curl http://localhost:3000/metrics
|
||||
|
||||
# Database connection
|
||||
psql $DATABASE_URL -c "SELECT 1"
|
||||
```
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
# Application logs
|
||||
tail -f logs/application-*.log
|
||||
|
||||
# Docker logs
|
||||
docker-compose logs -f app
|
||||
|
||||
# Audit logs (database)
|
||||
psql $DATABASE_URL -c "SELECT * FROM audit_logs ORDER BY timestamp DESC LIMIT 100"
|
||||
```
|
||||
|
||||
### Run Reconciliation
|
||||
|
||||
```bash
|
||||
# Via API
|
||||
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/daily?date=2024-01-01" \
|
||||
-H "Authorization: Bearer <token>"
|
||||
|
||||
# Check aging items
|
||||
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/aging?days=1" \
|
||||
-H "Authorization: Bearer <token>"
|
||||
```
|
||||
|
||||
### Database Operations
|
||||
|
||||
```bash
|
||||
# Run migrations
|
||||
npm run migrate
|
||||
|
||||
# Rollback last migration
|
||||
npm run migrate:rollback
|
||||
|
||||
# Seed operators
|
||||
npm run seed
|
||||
|
||||
# Backup database
|
||||
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql
|
||||
|
||||
# Restore database
|
||||
psql $DATABASE_URL < backup_20240101.sql
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Payment Stuck in Processing
|
||||
|
||||
**Symptoms:**
|
||||
- Payment status is `APPROVED` but not progressing
|
||||
- No ledger posting or message generation
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
SELECT id, status, created_at, updated_at
|
||||
FROM payments
|
||||
WHERE status = 'APPROVED'
|
||||
AND updated_at < NOW() - INTERVAL '5 minutes';
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Check application logs for errors
|
||||
2. Verify compliance screening status
|
||||
3. Check ledger adapter connectivity
|
||||
4. Manually trigger processing if needed
|
||||
|
||||
### TLS Connection Issues
|
||||
|
||||
**Symptoms:**
|
||||
- `tls_connection_errors_total` increasing
|
||||
- Circuit breaker in OPEN state
|
||||
- Messages not transmitting
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check TLS pool stats
|
||||
curl http://localhost:3000/metrics | grep tls
|
||||
|
||||
# Check receiver connectivity
|
||||
openssl s_client -connect 172.67.157.88:443 -servername devmindgroup.com
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Verify receiver IP/port configuration
|
||||
2. Check certificate validity
|
||||
3. Verify network connectivity
|
||||
4. Review TLS pool logs
|
||||
5. Reset circuit breaker if needed
|
||||
|
||||
### Database Connection Issues
|
||||
|
||||
**Symptoms:**
|
||||
- Health check shows database error
|
||||
- High connection pool usage
|
||||
- Query timeouts
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
-- Check active connections
|
||||
SELECT count(*) FROM pg_stat_activity;
|
||||
|
||||
-- Check connection pool stats
|
||||
SELECT * FROM pg_stat_database WHERE datname = 'dbis_core';
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Increase connection pool size in config
|
||||
2. Check for long-running queries
|
||||
3. Restart database if needed
|
||||
4. Review connection pool settings
|
||||
|
||||
### Reconciliation Exceptions
|
||||
|
||||
**Symptoms:**
|
||||
- High number of unmatched payments
|
||||
- Aging items accumulating
|
||||
|
||||
**Resolution:**
|
||||
1. Review reconciliation report
|
||||
2. Check exception queue
|
||||
3. Manually reconcile exceptions
|
||||
4. Investigate root cause (missing ACK, ledger mismatch, etc.)
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### Backup Procedures
|
||||
|
||||
**Daily Backups:**
|
||||
```bash
|
||||
# Database backup
|
||||
pg_dump $DATABASE_URL | gzip > backups/dbis_core_$(date +%Y%m%d).sql.gz
|
||||
|
||||
# Audit logs export (for compliance)
|
||||
psql $DATABASE_URL -c "\COPY audit_logs TO 'audit_logs_$(date +%Y%m%d).csv' CSV HEADER"
|
||||
```
|
||||
|
||||
### Recovery Procedures
|
||||
|
||||
**Database Recovery:**
|
||||
```bash
|
||||
# Stop application
|
||||
docker-compose stop app
|
||||
|
||||
# Restore database
|
||||
gunzip < backups/dbis_core_20240101.sql.gz | psql $DATABASE_URL
|
||||
|
||||
# Run migrations
|
||||
npm run migrate
|
||||
|
||||
# Restart application
|
||||
docker-compose start app
|
||||
```
|
||||
|
||||
### Data Retention
|
||||
|
||||
- **Audit Logs**: 7-10 years (configurable)
|
||||
- **Payment Records**: Indefinite (archived after 7 years)
|
||||
- **Application Logs**: 30 days
|
||||
|
||||
### Failover Procedures
|
||||
|
||||
1. **Application Failover:**
|
||||
- Deploy to secondary server
|
||||
- Update load balancer
|
||||
- Verify health checks
|
||||
|
||||
2. **Database Failover:**
|
||||
- Promote replica to primary
|
||||
- Update DATABASE_URL
|
||||
- Restart application
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
- **System Administrator**: [Contact]
|
||||
- **Database Administrator**: [Contact]
|
||||
- **Security Team**: [Contact]
|
||||
- **On-Call Engineer**: [Contact]
|
||||
|
||||
## Change Management
|
||||
|
||||
All changes to production must:
|
||||
1. Be tested in staging environment
|
||||
2. Have rollback plan documented
|
||||
3. Be approved by technical lead
|
||||
4. Be performed during maintenance window
|
||||
5. Be monitored post-deployment
|
||||
Reference in New Issue
Block a user