Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands - CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround - CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check - NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere - MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates - LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference Co-authored-by: Cursor <cursoragent@cursor.com>
527 lines
15 KiB
Markdown
527 lines
15 KiB
Markdown
# Ingress Architecture Risks and Hardening
|
|
|
|
**Last Updated:** 2026-01-31
|
|
**Document Version:** 1.0
|
|
**Status:** Active Documentation
|
|
|
|
---
|
|
|
|
**Date**: 2026-01-20
|
|
**Status**: Complete Risk Assessment
|
|
**Purpose**: Identify risks and hardening opportunities for ingress architecture
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This document identifies risks and hardening opportunities for the ingress architecture:
|
|
|
|
**Cloudflare DNS → UDM Pro port-forward → NPMplus (reverse proxy + SSL termination) → Backend VMs/services (nginx or direct ports)**
|
|
|
|
**Scope**: Identifies risks and provides hardening recommendations **without breaking production**.
|
|
|
|
---
|
|
|
|
## Identified Risks
|
|
|
|
### Risk 1: Single Point of Failure - NPMplus
|
|
|
|
**Severity**: High
|
|
**Component**: NPMplus (VMID 10233)
|
|
**Status**: Current
|
|
|
|
**Description**:
|
|
- NPMplus is a single reverse proxy container
|
|
- All ingress traffic depends on one container
|
|
- If NPMplus fails, all public-facing services become unavailable
|
|
|
|
**Impact**:
|
|
- Complete ingress outage if NPMplus container fails
|
|
- No redundancy or failover
|
|
- Single container failure affects all 19 domains
|
|
|
|
**Mitigation (Current)**:
|
|
- Container is monitored and backed up
|
|
- Configuration is documented and can be restored
|
|
- Container is running on stable Proxmox host (r630-01)
|
|
|
|
**Hardening Opportunities**:
|
|
- ✅ **HA Setup Guide Created**: Complete guide available at `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md`
|
|
- Deploy HA NPMplus instance (active-passive with Keepalived)
|
|
- Set up automatic failover (Keepalived virtual IP)
|
|
- Document manual failover procedures (done in backup/restore guide)
|
|
|
|
**Recommendation**:
|
|
- Review and implement HA setup guide during next maintenance window
|
|
- Set up container health monitoring
|
|
- Regular backups (done in backup/restore guide)
|
|
|
|
**HA Implementation**: See `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md` for complete step-by-step instructions.
|
|
|
|
---
|
|
|
|
### Risk 2: DNS-Only Mode (No Cloudflare Proxy/WAF)
|
|
|
|
**Severity**: Medium
|
|
**Component**: Cloudflare DNS
|
|
**Status**: Intentional Configuration
|
|
|
|
**Description**:
|
|
- All DNS records use "DNS Only" mode (gray cloud)
|
|
- No Cloudflare proxy, WAF, or DDoS protection
|
|
- Origin IPs (76.53.10.36) exposed directly
|
|
|
|
**Impact**:
|
|
- No DDoS protection from Cloudflare
|
|
- No WAF rules for application-layer attacks
|
|
- Origin IPs visible to attackers
|
|
- No CDN caching
|
|
|
|
**Rationale** (Intentional):
|
|
- Direct SSL termination at NPMplus required
|
|
- Cloudflare proxy would interfere with Let's Encrypt validation
|
|
- Allows direct control over SSL certificates
|
|
|
|
**Hardening Opportunities** (without breaking production):
|
|
|
|
1. **Enable Cloudflare Access for Admin Portals**:
|
|
- Add authentication layer for `dbis-admin.d-bis.org`
|
|
- Add authentication layer for `secure.d-bis.org`
|
|
- Does not require changing DNS proxy status
|
|
|
|
2. **Implement Rate Limiting at NPMplus**:
|
|
- Add rate limiting for RPC endpoints (especially public RPC)
|
|
- Configure rate limiting per IP or per domain
|
|
- Does not require changing DNS configuration
|
|
|
|
3. **Monitor and Alert on Unusual Traffic**:
|
|
- Set up log aggregation for NPMplus access logs
|
|
- Configure alerts for unusual traffic patterns
|
|
- Detect DDoS attempts early
|
|
|
|
**Not in Scope** (would require production changes):
|
|
- Enabling Cloudflare proxy (would require changing SSL termination)
|
|
- Changing to Cloudflare SSL (would require certificate changes)
|
|
|
|
**Recommendation**:
|
|
- Implement rate limiting for RPC endpoints
|
|
- Set up Cloudflare Access for admin portals
|
|
- Monitor traffic patterns and set up alerts
|
|
|
|
---
|
|
|
|
### Risk 3: Certificate Expiration
|
|
|
|
**Severity**: Medium
|
|
**Component**: SSL Certificates
|
|
**Status**: Current
|
|
|
|
**Description**:
|
|
- All 19 SSL certificates expire on **2026-04-16**
|
|
- Auto-renewal enabled but could fail
|
|
- Certificate failure would cause HTTPS outages
|
|
|
|
**Impact**:
|
|
- Services become inaccessible if certificates expire
|
|
- Browser warnings if certificates invalid
|
|
- All domains affected simultaneously (same expiration date)
|
|
|
|
**Current Mitigation**:
|
|
- Auto-renewal enabled in NPMplus
|
|
- Let's Encrypt handles renewal automatically
|
|
- Certificates valid until 2026-04-16
|
|
|
|
**Hardening Opportunities** (without breaking production):
|
|
|
|
1. **Certificate Expiration Monitoring**:
|
|
- Set up alerts 90/60/30 days before expiration
|
|
- Monitor certificate status via NPMplus API
|
|
- Alert if auto-renewal fails
|
|
|
|
2. **Certificate Verification Scripts**:
|
|
- Regular verification of certificate validity
|
|
- Automated checks for certificate expiration
|
|
- Integration with monitoring systems
|
|
|
|
**Recommendation**:
|
|
- Set up certificate expiration alerts
|
|
- Regular verification of certificate status
|
|
- Document manual renewal procedures (done in backup/restore guide)
|
|
|
|
---
|
|
|
|
### Risk 4: Sankofa Routing Issue
|
|
|
|
**Severity**: High
|
|
**Component**: Backend Routing
|
|
**Status**: Known, Cutover Plan in Place
|
|
|
|
**Description**:
|
|
- 5 Sankofa domains route to Blockscout (192.168.11.140) but services not deployed
|
|
- Incorrect routing prevents Sankofa services from working
|
|
- Users may access wrong content
|
|
|
|
**Impact**:
|
|
- Sankofa domains don't work as intended
|
|
- Incorrect content served (Blockscout instead of Sankofa)
|
|
- SSL certificates exist but services not available
|
|
|
|
**Current Status**:
|
|
- Known issue documented
|
|
- Cutover plan created (see `SANKOFA_CUTOVER_PLAN.md`)
|
|
- Waiting for Sankofa service deployment
|
|
|
|
**Mitigation**:
|
|
- Cutover plan in place
|
|
- Will update routing once services deployed
|
|
- Temporary routing keeps domains accessible (though incorrect)
|
|
|
|
**Recommendation**:
|
|
- Complete Sankofa service deployment
|
|
- Execute cutover plan when services ready
|
|
- Update source-of-truth after cutover
|
|
|
|
---
|
|
|
|
### Risk 5: UDM Pro Port Forwarding - Manual Configuration
|
|
|
|
**Severity**: Medium
|
|
**Component**: Edge Routing
|
|
**Status**: Current
|
|
|
|
**Description**:
|
|
- Port forwarding configured manually via UDM Pro web UI
|
|
- No automation or API access
|
|
- Risk of misconfiguration during changes
|
|
|
|
**Impact**:
|
|
- Manual errors during configuration changes
|
|
- No version control or audit trail
|
|
- Difficult to verify configuration matches documentation
|
|
|
|
**Hardening Opportunities** (without breaking production):
|
|
|
|
1. **Document Exact Steps**:
|
|
- Create detailed configuration guide
|
|
- Document exact values for port forwarding rules
|
|
- Create verification checklist
|
|
|
|
2. **Verification Procedures**:
|
|
- Regular verification of port forwarding rules
|
|
- Screenshot evidence of configuration
|
|
- Automated connectivity tests
|
|
|
|
**Recommendation**:
|
|
- Document exact port forwarding steps (done in verification runbook)
|
|
- Regular verification of configuration
|
|
- Screenshot evidence stored
|
|
|
|
---
|
|
|
|
### Risk 6: Backend VM Direct Access (No Nginx)
|
|
|
|
**Severity**: Low-Medium
|
|
**Component**: Backend VMs
|
|
**Status**: Intentional Configuration
|
|
|
|
**Description**:
|
|
- Some VMs accessible directly (no nginx layer)
|
|
- Besu RPC nodes (2101, 2201) expose ports 8545/8546 directly
|
|
- Node.js APIs (10150, 10151) expose port 3000 directly
|
|
|
|
**Impact**:
|
|
- Direct exposure of application ports
|
|
- No additional security layer (nginx headers, rate limiting)
|
|
- Application-level security only
|
|
|
|
**Rationale** (Intentional):
|
|
- RPC services require direct access for performance
|
|
- Node.js APIs designed for direct exposure
|
|
- Nginx layer adds unnecessary complexity for these services
|
|
|
|
**Hardening Opportunities** (without breaking production):
|
|
|
|
1. **Rate Limiting at NPMplus**:
|
|
- Add rate limiting to RPC proxy hosts
|
|
- Configure rate limits per IP or globally
|
|
- Prevent abuse without adding nginx layer
|
|
|
|
2. **Security Headers at NPMplus**:
|
|
- Add security headers via NPMplus advanced config
|
|
- Configure CSP, X-Frame-Options, etc.
|
|
- Apply to all proxy hosts
|
|
|
|
3. **Access Lists**:
|
|
- Configure IP allowlists for private RPC endpoints
|
|
- Restrict access to authorized IPs only
|
|
- Use NPMplus access lists feature
|
|
|
|
**Not in Scope** (would require production changes):
|
|
- Adding nginx layer to all services
|
|
- Changing backend architecture
|
|
|
|
**Recommendation**:
|
|
- Add rate limiting for RPC endpoints at NPMplus
|
|
- Configure access lists for private RPC endpoints
|
|
- Add security headers via NPMplus advanced config
|
|
|
|
---
|
|
|
|
### Risk 7: Internal TLS (Double TLS)
|
|
|
|
**Severity**: Low
|
|
**Component**: VMID 2400
|
|
**Status**: Current Configuration
|
|
|
|
**Description**:
|
|
- VMID 2400 (thirdweb-rpc-1) uses HTTPS internally (port 443)
|
|
- NPMplus terminates SSL, then proxies to HTTPS backend
|
|
- Results in double TLS termination (NPMplus → VMID 2400)
|
|
|
|
**Impact**:
|
|
- Additional complexity in certificate management
|
|
- Two SSL certificates required (NPMplus + VMID 2400)
|
|
- Potential performance overhead
|
|
|
|
**Rationale** (Documentation Needed):
|
|
- Need to document why this is required
|
|
- May be intentional for additional security
|
|
- Or legacy configuration that could be simplified
|
|
|
|
**Hardening Opportunities** (without breaking production):
|
|
|
|
1. **Document Internal TLS Rationale**:
|
|
- Document why VMID 2400 uses HTTPS internally
|
|
- Verify if internal TLS is necessary
|
|
- Document certificate management for internal TLS
|
|
|
|
2. **Monitor Internal TLS Certificate Expiration**:
|
|
- Track internal SSL certificate expiration
|
|
- Ensure internal certificates are renewed
|
|
- Avoid internal certificate expiration causing outages
|
|
|
|
**Recommendation**:
|
|
- Document why internal TLS is used
|
|
- Monitor internal certificate expiration
|
|
- Verify if internal TLS could be changed to HTTP (future consideration)
|
|
|
|
---
|
|
|
|
## Hardening Opportunities (Without Breaking Production)
|
|
|
|
### 1. Rate Limiting at NPMplus
|
|
|
|
**Priority**: High
|
|
**Effort**: Medium
|
|
**Impact**: High
|
|
|
|
**Implementation**:
|
|
- Configure rate limiting for RPC endpoints
|
|
- Set limits per IP (e.g., 100 requests/minute)
|
|
- Apply to all RPC proxy hosts
|
|
|
|
**Steps**:
|
|
1. Access NPMplus UI
|
|
2. Navigate to Proxy Hosts
|
|
3. Edit RPC proxy hosts (rpc-http-pub, rpc-ws-pub, etc.)
|
|
4. Configure rate limiting in advanced config or access lists
|
|
5. Test rate limiting behavior
|
|
|
|
**Benefits**:
|
|
- Protects RPC endpoints from abuse
|
|
- Prevents DDoS attacks
|
|
- Does not require backend changes
|
|
|
|
---
|
|
|
|
### 2. Cloudflare Access for Admin Portals
|
|
|
|
**Priority**: Medium
|
|
**Effort**: Medium
|
|
**Impact**: Medium
|
|
|
|
**Implementation**:
|
|
- Enable Cloudflare Access for `dbis-admin.d-bis.org`
|
|
- Enable Cloudflare Access for `secure.d-bis.org`
|
|
- Configure access policies (email allowlist, MFA, etc.)
|
|
|
|
**Steps**:
|
|
1. Access Cloudflare Zero Trust dashboard
|
|
2. Navigate to Access → Applications
|
|
3. Add application: `dbis-admin.d-bis.org`
|
|
4. Configure access policy (email allowlist, MFA)
|
|
5. Repeat for `secure.d-bis.org`
|
|
|
|
**Benefits**:
|
|
- Additional authentication layer
|
|
- MFA support
|
|
- Audit trail
|
|
- Does not require changing DNS proxy status
|
|
|
|
---
|
|
|
|
### 3. Certificate Expiration Monitoring
|
|
|
|
**Priority**: High
|
|
**Effort**: Low
|
|
**Impact**: High
|
|
|
|
**Implementation**:
|
|
- Set up monitoring for certificate expiration
|
|
- Configure alerts 90/60/30 days before expiration
|
|
- Monitor auto-renewal status
|
|
|
|
**Steps**:
|
|
1. Create monitoring script or use existing verification scripts
|
|
2. Run daily checks of certificate expiration
|
|
3. Configure alerts (email, Slack, etc.)
|
|
4. Test alert system
|
|
|
|
**Script**:
|
|
```bash
|
|
# Run certificate verification daily
|
|
bash scripts/verify/export-npmplus-config.sh
|
|
|
|
# Check expiration dates
|
|
cat docs/04-configuration/verification-evidence/npmplus-verification-*/certificates.json | \
|
|
jq '.[] | select(.expires | fromdateiso8601 < (now + (90 * 86400))) | .domain_names'
|
|
```
|
|
|
|
**Benefits**:
|
|
- Early warning of certificate expiration
|
|
- Time to fix auto-renewal issues
|
|
- Prevents unexpected outages
|
|
|
|
---
|
|
|
|
### 4. Health Check Endpoints for All Backend Services
|
|
|
|
**Priority**: Medium
|
|
**Effort**: Low-Medium
|
|
**Impact**: Medium
|
|
|
|
**Implementation**:
|
|
- Add health check endpoints to all backend services
|
|
- Configure health checks in NPMplus (if supported)
|
|
- Monitor health endpoints
|
|
|
|
**Steps**:
|
|
1. Add `/health` endpoints to all backend services
|
|
2. Configure health checks in application config
|
|
3. Set up monitoring for health endpoints
|
|
4. Configure alerts for failed health checks
|
|
|
|
**Benefits**:
|
|
- Early detection of service issues
|
|
- Proactive monitoring
|
|
- Better troubleshooting
|
|
|
|
---
|
|
|
|
### 5. Log Aggregation for NPMplus Access Logs
|
|
|
|
**Priority**: Medium
|
|
**Effort**: Medium
|
|
**Impact**: Medium
|
|
|
|
**Implementation**:
|
|
- Set up log aggregation for NPMplus access logs
|
|
- Configure log forwarding (syslog, filebeat, etc.)
|
|
- Set up log analysis and alerting
|
|
|
|
**Steps**:
|
|
1. Configure NPMplus to log to syslog or file
|
|
2. Set up log forwarder (filebeat, fluentd, etc.)
|
|
3. Configure log aggregation (ELK stack, Loki, etc.)
|
|
4. Set up alerts for unusual patterns
|
|
|
|
**Benefits**:
|
|
- Better visibility into traffic patterns
|
|
- Detect attacks early
|
|
- Audit trail for troubleshooting
|
|
|
|
---
|
|
|
|
### 6. Document Failover Procedures
|
|
|
|
**Priority**: High
|
|
**Effort**: Low
|
|
**Impact**: High
|
|
|
|
**Implementation**:
|
|
- Document failover procedures if NPMplus fails
|
|
- Create step-by-step recovery guide
|
|
- Test failover procedures
|
|
|
|
**Status**: ✅ Done in `NPMPLUS_BACKUP_RESTORE.md`
|
|
|
|
---
|
|
|
|
## Not in Scope (Would Require Production Changes)
|
|
|
|
The following hardening measures would require production changes and are **not in scope** for this plan:
|
|
|
|
1. **Enabling Cloudflare Proxy**:
|
|
- Would require changing SSL termination from NPMplus to Cloudflare
|
|
- Would require reconfiguration of all SSL certificates
|
|
- Would break current architecture
|
|
|
|
2. **Adding HA NPMplus Instance**:
|
|
- Would require deployment of additional NPMplus container
|
|
- Would require load balancer configuration
|
|
- Would require database replication or shared storage
|
|
|
|
3. **Changing Backend Architecture**:
|
|
- Adding nginx layer to all services
|
|
- Changing RPC endpoints to use nginx
|
|
- Would require application changes
|
|
|
|
---
|
|
|
|
## Risk Summary Table
|
|
|
|
| Risk | Severity | Status | Mitigation | Hardening Priority |
|
|
|------|----------|--------|------------|-------------------|
|
|
| Single Point of Failure (NPMplus) | High | Current | Documented | High (monitoring) |
|
|
| DNS-Only Mode | Medium | Intentional | Rate limiting, Cloudflare Access | Medium |
|
|
| Certificate Expiration | Medium | Current | Auto-renewal | High (monitoring) |
|
|
| Sankofa Routing Issue | High | Known | Cutover plan in place | High (cutover) |
|
|
| UDM Pro Manual Config | Medium | Current | Documentation | Medium (verification) |
|
|
| Backend Direct Access | Low-Medium | Intentional | Rate limiting | Medium |
|
|
| Internal TLS | Low | Current | Documentation | Low (documentation) |
|
|
|
|
---
|
|
|
|
## Hardening Implementation Priority
|
|
|
|
### High Priority (Implement First)
|
|
|
|
1. **Certificate Expiration Monitoring** - Critical for preventing outages
|
|
2. **Rate Limiting for RPC Endpoints** - Prevents abuse
|
|
3. **Document Failover Procedures** - ✅ Done
|
|
|
|
### Medium Priority
|
|
|
|
4. **Cloudflare Access for Admin Portals** - Additional security
|
|
5. **Health Check Endpoints** - Better monitoring
|
|
6. **Log Aggregation** - Better visibility
|
|
|
|
### Low Priority
|
|
|
|
7. **Document Internal TLS Rationale** - Documentation improvement
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **Verification Runbook**: `docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md`
|
|
- **Backup/Restore Guide**: `docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md`
|
|
- **Sankofa Cutover Plan**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md`
|
|
- **Comprehensive Architecture**: `docs/04-configuration/DNS_NPMPLUS_VM_COMPREHENSIVE_ARCHITECTURE.md`
|
|
|
|
---
|
|
|
|
**Last Updated**: 2026-01-20
|
|
**Maintained By**: Infrastructure Team
|
|
**Status**: Complete Risk Assessment
|