Files
smom-dbis-138/docs/OPERATIONS_RUNBOOK.md
2026-03-02 12:14:09 -08:00

593 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Operations Runbook - Complete System
**Date**: Operations Runbook
**Status**: ✅ COMPLETE
---
## Overview
This runbook provides operational procedures for:
1. Vault System Operations
2. ISO-4217 W Token System Operations
3. Bridge System Operations
4. Emergency Procedures
---
## 1. Daily Operations
### 1.1 Vault System Monitoring
#### Health Check
```bash
# Check vault health ratios
cast call $LEDGER_ADDRESS "getVaultHealth(address)" $VAULT_ADDRESS --rpc-url $RPC_URL
# Check total collateral
cast call $LEDGER_ADDRESS "totalCollateral(address)" $ASSET_ADDRESS --rpc-url $RPC_URL
# Check total debt
cast call $LEDGER_ADDRESS "totalDebt(address)" $CURRENCY_ADDRESS --rpc-url $RPC_URL
```
#### Alert Thresholds
- **Health Ratio < 120%**: Warning alert
- **Health Ratio < 110%**: Critical alert (liquidation threshold)
- **Debt Ceiling > 90%**: Warning alert
- **Oracle Staleness > 1 hour**: Critical alert
---
### 1.2 ISO-4217 W Token Monitoring
#### Reserve Verification
```bash
# Check reserve sufficiency for USDW
cast call $USDW_ADDRESS "isReserveSufficient()" --rpc-url $RPC_URL
# Get reserve balance
cast call $USDW_ADDRESS "verifiedReserve()" --rpc-url $RPC_URL
# Get total supply
cast call $USDW_ADDRESS "totalSupply()" --rpc-url $RPC_URL
# Calculate reserve ratio
# Reserve Ratio = (verifiedReserve / totalSupply) * 100
```
#### Daily Reserve Check
1. **Check Reserve Oracle Reports**
```bash
cast call $RESERVE_ORACLE "getVerifiedReserve(address)" $USDW_ADDRESS --rpc-url $RPC_URL
```
2. **Verify Quorum**
```bash
cast call $RESERVE_ORACLE "isQuorumMet(address)" $USDW_ADDRESS --rpc-url $RPC_URL
```
3. **Check for Stale Reports**
- Reports older than 1 hour should be removed
- If quorum not met, investigate oracle issues
#### Alert Thresholds
- **Reserve Ratio < 100%**: CRITICAL - Minting must halt
- **Reserve Ratio < 105%**: Warning alert
- **Oracle Quorum Not Met**: Critical alert
- **Stale Reports Detected**: Warning alert
---
### 1.3 Bridge System Monitoring
#### Bridge Health Metrics
```bash
# Check bridge success rate
# Query bridge events for success/failure counts
# Check settlement times
# Monitor TransferStatusUpdated events
# Check reserve verification failures
# Monitor ReserveVerified events with sufficient=false
```
#### Alert Thresholds
- **Success Rate < 95%**: Warning alert
- **Success Rate < 90%**: Critical alert
- **Settlement Time > 1 hour**: Warning alert
- **Reserve Verification Failures**: Critical alert
- **Compliance Violations**: Critical alert
---
### 1.4 Reserve and Stabilization Policies (VAULT_SYSTEM_MASTER_TECHNICAL_PLAN)
The following formulas and checklists are from [VAULT_SYSTEM_MASTER_TECHNICAL_PLAN](../../../docs/VAULT_SYSTEM_MASTER_TECHNICAL_PLAN.md). Use them for sizing and operational verification.
#### Reserve Sizing Model
- **Variables:** PeakMinuteOutflow = P, StabilizationWindow = T (minutes).
- **Required reserve:** Reserve ≥ P × T.
- **Recommended safety factor:** 35× peak minute outflow.
- **Example:** P = 10,000, T = 5 min → Reserve ≥ 50,000; with 3× safety → 150,000.
#### Cantilever Stabilization Model
- **Condition:** s × f ≥ Δ (s = micro trade size, f = micro trade frequency, Δ = net imbalance per minute).
- **Dynamic rule:** If deviation > θ, set s = k × deviation (eliminates fixed frequency dependency).
- **Use:** Size and frequency of stabilization trades so throughput offsets macro flow.
#### Bridge Liquidity Buffer
- **Rule:** BridgeReserve ≥ PeakBridgeOutflow × Latency (where Latency = bridge settlement time).
- **Use:** Ensure cross-chain bridge buffers satisfy this so outflows do not exhaust reserves during settlement.
#### Cross-chain parity and bridge buffer
- **Objective:** Maintain |Price138 Price651940| &lt; ArbitrageThreshold (see [CROSS_CHAIN_ARBITRAGE_DESIGN](../../../docs/07-ccip/CROSS_CHAIN_ARBITRAGE_DESIGN.md)). Cross-chain private arbitrage bots execute when deviation exceeds threshold; bridge reserve must be sized so outflows do not exhaust reserves during settlement.
- **Bridge buffer formula:** BridgeReserve ≥ PeakBridgeOutflow × Latency.
- **PeakBridgeOutflow:** Measure from bridge events (e.g. lock/release or TransferInitiated volume) over a rolling window (e.g. peak hourly or daily outflow in USD or token units).
- **Latency:** Bridge settlement time (e.g. typical time from lock on source chain to release on destination, in minutes or blocks). Use historical median or P95.
- **Sizing steps:** (1) Query bridge contract events for initiated/released amounts per time window; (2) compute peak outflow; (3) measure typical settlement latency; (4) set minimum reserve = Peak × Latency; (5) add safety factor (e.g. 1.52×) and document in runbook.
- **Alert when reserve below:** If BridgeReserve &lt; PeakBridgeOutflow × Latency (or below safety threshold), trigger **Warning** alert. If reserve is falling and may breach within one settlement window, escalate to **Critical**. Integrate with existing monitoring (e.g. Prometheus + PagerDuty when monitoring stack is deployed). See [VAULT_SYSTEM_MASTER_TECHNICAL_PLAN](../../../docs/VAULT_SYSTEM_MASTER_TECHNICAL_PLAN.md) §9.
#### Flash Loan Containment Checklist
- Use **TWAP deviation detection** (not single-block price).
- **Ignore single-block imbalance** for stabilizer triggers.
- Require **sustained deviation for N blocks** before rebalancing.
- **Cap per-block stabilization volume** to limit flash-driven execution.
- **Target:** Flash drain recovery &lt;3 blocks (per Master Plan §16).
- **On-chain:** The [Stabilizer](../../contracts/bridge/trustless/integration/Stabilizer.sol) (Phase 3 + 6) implements block delay, sustained-deviation buffer, per-block volume cap, and slippage/gas checks; deploy and configure per [CONTRACT_DEPLOYMENT_RUNBOOK](../../../docs/03-deployment/CONTRACT_DEPLOYMENT_RUNBOOK.md) § Stabilizer.
---
## 2. Weekly Operations
### 2.1 Reserve Attestation
#### Weekly Reserve Report
1. **Collect Custodial Balances**
- USDW: Check USD custodial account
- EURW: Check EUR custodial account
- GBPW: Check GBP custodial account
2. **Submit Oracle Reports**
```solidity
reserveOracle.submitReserveReport(
tokenAddress,
reserveBalance,
block.timestamp
);
```
3. **Verify Consensus**
- Ensure quorum is met
- Verify consensus matches custodial balance
4. **Publish Proof-of-Reserves**
- Generate Merkle tree of reserves
- Publish on-chain hash
- Update public dashboard
---
### 2.2 System Health Review
#### Review Metrics
- Total vaults created
- Total collateral locked
- Total debt issued
- W token supply per currency
- Reserve ratios
- Bridge operations count
- Success rates
#### Generate Report
- Weekly operations report
- Reserve attestation report
- Compliance status report
---
## 3. Monthly Operations
### 3.1 Security Review
#### Access Control Audit
1. Review all role assignments
2. Verify principle of least privilege
3. Check for unused roles
4. Review multi-sig configurations
#### Compliance Audit
1. Verify money multiplier = 1.0 (all W tokens)
2. Verify GRU isolation (no GRU conversions)
3. Verify ISO-4217 compliance
4. Review reserve attestations
#### Code Review
1. Review recent changes
2. Check for security updates
3. Review dependency updates
4. Verify test coverage
---
### 3.2 Performance Review
#### Gas Optimization
- Review gas usage trends
- Identify optimization opportunities
- Test optimization proposals
#### System Performance
- Review transaction throughput
- Check oracle update frequency
- Review bridge settlement times
- Analyze user patterns
---
## 4. Emergency Procedures
### 4.1 Reserve Shortfall (W Tokens)
#### Symptoms
- Reserve < Supply for any W token
- Money multiplier < 1.0
- Reserve verification fails
#### Immediate Actions
1. **Halt Minting**
```solidity
// Disable mint controller
mintController.revokeRole(keccak256("MINTER_ROLE"), minterAddress);
```
2. **Alert Team**
- Notify operations team
- Notify compliance team
- Prepare public statement
3. **Investigate**
- Check custodial account balance
- Verify oracle reports
- Check for accounting errors
4. **Remediation**
- If accounting error: Correct and resume
- If actual shortfall: Add reserves or halt operations
- If oracle issue: Fix oracle and resume
#### Recovery Steps
1. Verify reserve restored
2. Re-enable minting
3. Resume normal operations
4. Post-mortem review
---
### 4.2 Vault Liquidation Event
#### Symptoms
- Vault health ratio < 110%
- Liquidation triggered
#### Immediate Actions
1. **Verify Liquidation**
```bash
cast call $LIQUIDATION_ADDRESS "canLiquidate(address)" $VAULT_ADDRESS --rpc-url $RPC_URL
```
2. **Monitor Liquidation**
- Track liquidation events
- Verify collateral seized
- Verify debt repaid
3. **Post-Liquidation**
- Check remaining vault health
- Verify system stability
- Notify vault owner
---
### 4.3 Bridge Failure
#### Symptoms
- Bridge transaction fails
- Settlement timeout
- Reserve verification fails on bridge
#### Immediate Actions
1. **Check Bridge Status**
```bash
cast call $BRIDGE_REGISTRY "destinations(uint256)" $CHAIN_ID --rpc-url $RPC_URL
```
2. **Investigate Failure**
- Check transaction logs
- Verify destination chain status
- Check reserve verification
3. **Initiate Refund** (if timeout)
```solidity
bridgeEscrowVault.initiateRefund(refundRequest, hsmSigner);
bridgeEscrowVault.executeRefund(transferId);
```
4. **Resume Operations**
- Fix underlying issue
- Re-enable bridge route
- Resume normal operations
---
### 4.4 Oracle Failure
#### Symptoms
- Oracle staleness detected
- Quorum not met
- Price feed failure
#### Immediate Actions
1. **Check Oracle Status**
```bash
cast call $XAU_ORACLE "isFrozen()" --rpc-url $RPC_URL
cast call $RESERVE_ORACLE "isQuorumMet(address)" $TOKEN_ADDRESS --rpc-url $RPC_URL
```
2. **Freeze System** (if critical)
```solidity
xauOracle.freeze();
// Pause vault operations if needed
```
3. **Fix Oracle**
- Add new oracle feeds
- Remove stale reports
- Restore quorum
4. **Resume Operations**
```solidity
xauOracle.unfreeze();
```
---
### 4.5 Compliance Violation
#### Symptoms
- Money multiplier > 1.0 detected
- GRU conversion detected
- ISO-4217 violation
#### Immediate Actions
1. **Halt Operations**
- Pause minting
- Pause bridging
- Freeze affected tokens
2. **Investigate**
- Review transaction history
- Identify violation source
- Check compliance guard logs
3. **Remediation**
- Fix violation
- Restore compliance
- Resume operations
4. **Post-Mortem**
- Document violation
- Update compliance rules
- Prevent recurrence
---
## 5. Incident Response
### 5.1 Incident Classification
#### Severity Levels
**CRITICAL (P0)**:
- Reserve < Supply (money multiplier violation)
- System compromise
- Complete system failure
**HIGH (P1)**:
- Reserve ratio < 105%
- Bridge failures > 10%
- Oracle quorum failure
**MEDIUM (P2)**:
- Reserve ratio < 110%
- Bridge failures 5-10%
- Single oracle failure
**LOW (P3)**:
- Minor performance issues
- Non-critical alerts
- Documentation updates
---
### 5.2 Incident Response Process
#### Step 1: Detection
- Monitor alerts
- Review logs
- User reports
#### Step 2: Assessment
- Classify severity
- Assess impact
- Identify root cause
#### Step 3: Containment
- Apply emergency procedures
- Halt affected operations
- Isolate issue
#### Step 4: Resolution
- Fix root cause
- Restore operations
- Verify fix
#### Step 5: Post-Mortem
- Document incident
- Identify improvements
- Update procedures
---
## 6. Backup & Recovery
### 6.1 Backup Procedures
#### Daily Backups
- Contract state snapshots
- Configuration backups
- Access control backups
#### Weekly Backups
- Complete system state
- Oracle configuration
- Compliance rules
#### Monthly Backups
- Full system archive
- Historical data
- Audit logs
---
### 6.2 Recovery Procedures
#### Contract State Recovery
1. Identify backup point
2. Restore contract state
3. Verify restoration
4. Resume operations
#### Configuration Recovery
1. Restore configuration files
2. Verify settings
3. Test functionality
4. Resume operations
---
## 7. Monitoring Setup
### 7.1 Key Metrics
#### Vault System Metrics
- Total vaults
- Total collateral (by asset)
- Total debt (by currency)
- Average health ratio
- Liquidation events
#### W Token Metrics
- Supply per token (USDW, EURW, etc.)
- Reserve balance per token
- Reserve ratio per token
- Mint/burn events
- Redemption events
#### Bridge Metrics
- Bridge success rate
- Average settlement time
- Reserve verification success rate
- Compliance check success rate
- Transfer volume
---
### 7.2 Alert Configuration
#### Critical Alerts
```yaml
- name: Reserve Shortfall
condition: reserveRatio < 100%
action: halt_minting
- name: Money Multiplier Violation
condition: reserve < supply
action: emergency_pause
- name: Bridge Failure Rate High
condition: successRate < 90%
action: alert_team
```
#### Warning Alerts
```yaml
- name: Reserve Ratio Low
condition: reserveRatio < 105%
action: alert_team
- name: Vault Health Low
condition: healthRatio < 120%
action: alert_team
- name: Oracle Staleness
condition: reportAge > 1hour
action: alert_team
```
---
## 8. Operational Checklists
### 8.1 Daily Checklist
- [ ] Check all reserve ratios (W tokens)
- [ ] Verify oracle quorum status
- [ ] Check vault health ratios
- [ ] Review bridge success rates
- [ ] Check for critical alerts
- [ ] Review error logs
### 8.2 Weekly Checklist
- [ ] Submit reserve attestations
- [ ] Review system metrics
- [ ] Check access control roles
- [ ] Review compliance status
- [ ] Generate weekly report
- [ ] Update documentation
### 8.3 Monthly Checklist
- [ ] Security review
- [ ] Compliance audit
- [ ] Performance review
- [ ] Backup verification
- [ ] Update procedures
- [ ] Team training
---
## 9. Contact Information
### Emergency Contacts
- **Operations Team**: [Contact Info]
- **Security Team**: [Contact Info]
- **Compliance Team**: [Contact Info]
- **On-Call Engineer**: [Contact Info]
### Escalation Path
1. Operations Team (First Response)
2. Security Team (Security Issues)
3. Compliance Team (Compliance Issues)
4. Management (Critical Issues)
---
**Last Updated**: Operations Runbook Complete