6.0 KiB
6.0 KiB
Operational Runbook
Table of Contents
System Overview
Architecture
- Application: Node.js/TypeScript Express server
- Database: PostgreSQL 14+
- Cache/Sessions: Redis (optional)
- Metrics: Prometheus format on
/metrics - Health Check:
/healthendpoint
Key Endpoints
- API Base:
/api/v1 - Terminal UI:
/ - Health:
/health - Metrics:
/metrics - API Docs:
/api-docs
Monitoring & Alerts
Key Metrics to Monitor
Payment Metrics
payments_initiated_total- Total payments initiatedpayments_approved_total- Total payments approvedpayments_completed_total- Total payments completedpayments_failed_total- Total payments failedpayment_processing_duration_seconds- Processing latency
TLS Metrics
tls_connections_active- Active TLS connectionstls_connection_errors_total- TLS connection errorstls_acks_received_total- ACKs receivedtls_nacks_received_total- NACKs received
System Metrics
http_request_duration_seconds- HTTP request latencyprocess_cpu_user_seconds_total- CPU usageprocess_resident_memory_bytes- Memory usage
Alert Thresholds
Critical Alerts:
- Payment failure rate > 5% in 5 minutes
- TLS connection errors > 10 in 1 minute
- Database connection pool exhaustion
- Health check failing
Warning Alerts:
- Payment processing latency p95 > 30s
- Unmatched reconciliation items > 10
- TLS circuit breaker OPEN state
Common Operations
Start System
# Using npm
npm start
# Using Docker Compose
docker-compose up -d
# Verify health
curl http://localhost:3000/health
Stop System
# Graceful shutdown
docker-compose down
# Or send SIGTERM to process
kill -TERM <pid>
Check System Status
# Health check
curl http://localhost:3000/health
# Metrics
curl http://localhost:3000/metrics
# Database connection
psql $DATABASE_URL -c "SELECT 1"
View Logs
# Application logs
tail -f logs/application-*.log
# Docker logs
docker-compose logs -f app
# Audit logs (database)
psql $DATABASE_URL -c "SELECT * FROM audit_logs ORDER BY timestamp DESC LIMIT 100"
Run Reconciliation
# Via API
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/daily?date=2024-01-01" \
-H "Authorization: Bearer <token>"
# Check aging items
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/aging?days=1" \
-H "Authorization: Bearer <token>"
Database Operations
# Run migrations
npm run migrate
# Rollback last migration
npm run migrate:rollback
# Seed operators
npm run seed
# Backup database
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql
# Restore database
psql $DATABASE_URL < backup_20240101.sql
Troubleshooting
Payment Stuck in Processing
Symptoms:
- Payment status is
APPROVEDbut not progressing - No ledger posting or message generation
Diagnosis:
SELECT id, status, created_at, updated_at
FROM payments
WHERE status = 'APPROVED'
AND updated_at < NOW() - INTERVAL '5 minutes';
Resolution:
- Check application logs for errors
- Verify compliance screening status
- Check ledger adapter connectivity
- Manually trigger processing if needed
TLS Connection Issues
Symptoms:
tls_connection_errors_totalincreasing- Circuit breaker in OPEN state
- Messages not transmitting
Diagnosis:
# Check TLS pool stats
curl http://localhost:3000/metrics | grep tls
# Check receiver connectivity
openssl s_client -connect 172.67.157.88:443 -servername devmindgroup.com
Resolution:
- Verify receiver IP/port configuration
- Check certificate validity
- Verify network connectivity
- Review TLS pool logs
- Reset circuit breaker if needed
Database Connection Issues
Symptoms:
- Health check shows database error
- High connection pool usage
- Query timeouts
Diagnosis:
-- Check active connections
SELECT count(*) FROM pg_stat_activity;
-- Check connection pool stats
SELECT * FROM pg_stat_database WHERE datname = 'dbis_core';
Resolution:
- Increase connection pool size in config
- Check for long-running queries
- Restart database if needed
- Review connection pool settings
Reconciliation Exceptions
Symptoms:
- High number of unmatched payments
- Aging items accumulating
Resolution:
- Review reconciliation report
- Check exception queue
- Manually reconcile exceptions
- Investigate root cause (missing ACK, ledger mismatch, etc.)
Disaster Recovery
Backup Procedures
Daily Backups:
# Database backup
pg_dump $DATABASE_URL | gzip > backups/dbis_core_$(date +%Y%m%d).sql.gz
# Audit logs export (for compliance)
psql $DATABASE_URL -c "\COPY audit_logs TO 'audit_logs_$(date +%Y%m%d).csv' CSV HEADER"
Recovery Procedures
Database Recovery:
# Stop application
docker-compose stop app
# Restore database
gunzip < backups/dbis_core_20240101.sql.gz | psql $DATABASE_URL
# Run migrations
npm run migrate
# Restart application
docker-compose start app
Data Retention
- Audit Logs: 7-10 years (configurable)
- Payment Records: Indefinite (archived after 7 years)
- Application Logs: 30 days
Failover Procedures
-
Application Failover:
- Deploy to secondary server
- Update load balancer
- Verify health checks
-
Database Failover:
- Promote replica to primary
- Update DATABASE_URL
- Restart application
Emergency Contacts
- System Administrator: [Contact]
- Database Administrator: [Contact]
- Security Team: [Contact]
- On-Call Engineer: [Contact]
Change Management
All changes to production must:
- Be tested in staging environment
- Have rollback plan documented
- Be approved by technical lead
- Be performed during maintenance window
- Be monitored post-deployment