593 lines
14 KiB
Markdown
593 lines
14 KiB
Markdown
# Operations Runbook - Complete System
|
||
|
||
**Date**: Operations Runbook
|
||
**Status**: ✅ COMPLETE
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
This runbook provides operational procedures for:
|
||
1. Vault System Operations
|
||
2. ISO-4217 W Token System Operations
|
||
3. Bridge System Operations
|
||
4. Emergency Procedures
|
||
|
||
---
|
||
|
||
## 1. Daily Operations
|
||
|
||
### 1.1 Vault System Monitoring
|
||
|
||
#### Health Check
|
||
```bash
|
||
# Check vault health ratios
|
||
cast call $LEDGER_ADDRESS "getVaultHealth(address)" $VAULT_ADDRESS --rpc-url $RPC_URL
|
||
|
||
# Check total collateral
|
||
cast call $LEDGER_ADDRESS "totalCollateral(address)" $ASSET_ADDRESS --rpc-url $RPC_URL
|
||
|
||
# Check total debt
|
||
cast call $LEDGER_ADDRESS "totalDebt(address)" $CURRENCY_ADDRESS --rpc-url $RPC_URL
|
||
```
|
||
|
||
#### Alert Thresholds
|
||
- **Health Ratio < 120%**: Warning alert
|
||
- **Health Ratio < 110%**: Critical alert (liquidation threshold)
|
||
- **Debt Ceiling > 90%**: Warning alert
|
||
- **Oracle Staleness > 1 hour**: Critical alert
|
||
|
||
---
|
||
|
||
### 1.2 ISO-4217 W Token Monitoring
|
||
|
||
#### Reserve Verification
|
||
```bash
|
||
# Check reserve sufficiency for USDW
|
||
cast call $USDW_ADDRESS "isReserveSufficient()" --rpc-url $RPC_URL
|
||
|
||
# Get reserve balance
|
||
cast call $USDW_ADDRESS "verifiedReserve()" --rpc-url $RPC_URL
|
||
|
||
# Get total supply
|
||
cast call $USDW_ADDRESS "totalSupply()" --rpc-url $RPC_URL
|
||
|
||
# Calculate reserve ratio
|
||
# Reserve Ratio = (verifiedReserve / totalSupply) * 100
|
||
```
|
||
|
||
#### Daily Reserve Check
|
||
1. **Check Reserve Oracle Reports**
|
||
```bash
|
||
cast call $RESERVE_ORACLE "getVerifiedReserve(address)" $USDW_ADDRESS --rpc-url $RPC_URL
|
||
```
|
||
|
||
2. **Verify Quorum**
|
||
```bash
|
||
cast call $RESERVE_ORACLE "isQuorumMet(address)" $USDW_ADDRESS --rpc-url $RPC_URL
|
||
```
|
||
|
||
3. **Check for Stale Reports**
|
||
- Reports older than 1 hour should be removed
|
||
- If quorum not met, investigate oracle issues
|
||
|
||
#### Alert Thresholds
|
||
- **Reserve Ratio < 100%**: CRITICAL - Minting must halt
|
||
- **Reserve Ratio < 105%**: Warning alert
|
||
- **Oracle Quorum Not Met**: Critical alert
|
||
- **Stale Reports Detected**: Warning alert
|
||
|
||
---
|
||
|
||
### 1.3 Bridge System Monitoring
|
||
|
||
#### Bridge Health Metrics
|
||
```bash
|
||
# Check bridge success rate
|
||
# Query bridge events for success/failure counts
|
||
|
||
# Check settlement times
|
||
# Monitor TransferStatusUpdated events
|
||
|
||
# Check reserve verification failures
|
||
# Monitor ReserveVerified events with sufficient=false
|
||
```
|
||
|
||
#### Alert Thresholds
|
||
- **Success Rate < 95%**: Warning alert
|
||
- **Success Rate < 90%**: Critical alert
|
||
- **Settlement Time > 1 hour**: Warning alert
|
||
- **Reserve Verification Failures**: Critical alert
|
||
- **Compliance Violations**: Critical alert
|
||
|
||
---
|
||
|
||
### 1.4 Reserve and Stabilization Policies (VAULT_SYSTEM_MASTER_TECHNICAL_PLAN)
|
||
|
||
The following formulas and checklists are from [VAULT_SYSTEM_MASTER_TECHNICAL_PLAN](../../../docs/VAULT_SYSTEM_MASTER_TECHNICAL_PLAN.md). Use them for sizing and operational verification.
|
||
|
||
#### Reserve Sizing Model
|
||
|
||
- **Variables:** PeakMinuteOutflow = P, StabilizationWindow = T (minutes).
|
||
- **Required reserve:** Reserve ≥ P × T.
|
||
- **Recommended safety factor:** 3–5× peak minute outflow.
|
||
- **Example:** P = 10,000, T = 5 min → Reserve ≥ 50,000; with 3× safety → 150,000.
|
||
|
||
#### Cantilever Stabilization Model
|
||
|
||
- **Condition:** s × f ≥ Δ (s = micro trade size, f = micro trade frequency, Δ = net imbalance per minute).
|
||
- **Dynamic rule:** If deviation > θ, set s = k × deviation (eliminates fixed frequency dependency).
|
||
- **Use:** Size and frequency of stabilization trades so throughput offsets macro flow.
|
||
|
||
#### Bridge Liquidity Buffer
|
||
|
||
- **Rule:** BridgeReserve ≥ PeakBridgeOutflow × Latency (where Latency = bridge settlement time).
|
||
- **Use:** Ensure cross-chain bridge buffers satisfy this so outflows do not exhaust reserves during settlement.
|
||
|
||
#### Cross-chain parity and bridge buffer
|
||
|
||
- **Objective:** Maintain |Price138 − Price651940| < ArbitrageThreshold (see [CROSS_CHAIN_ARBITRAGE_DESIGN](../../../docs/07-ccip/CROSS_CHAIN_ARBITRAGE_DESIGN.md)). Cross-chain private arbitrage bots execute when deviation exceeds threshold; bridge reserve must be sized so outflows do not exhaust reserves during settlement.
|
||
- **Bridge buffer formula:** BridgeReserve ≥ PeakBridgeOutflow × Latency.
|
||
- **PeakBridgeOutflow:** Measure from bridge events (e.g. lock/release or TransferInitiated volume) over a rolling window (e.g. peak hourly or daily outflow in USD or token units).
|
||
- **Latency:** Bridge settlement time (e.g. typical time from lock on source chain to release on destination, in minutes or blocks). Use historical median or P95.
|
||
- **Sizing steps:** (1) Query bridge contract events for initiated/released amounts per time window; (2) compute peak outflow; (3) measure typical settlement latency; (4) set minimum reserve = Peak × Latency; (5) add safety factor (e.g. 1.5–2×) and document in runbook.
|
||
- **Alert when reserve below:** If BridgeReserve < PeakBridgeOutflow × Latency (or below safety threshold), trigger **Warning** alert. If reserve is falling and may breach within one settlement window, escalate to **Critical**. Integrate with existing monitoring (e.g. Prometheus + PagerDuty when monitoring stack is deployed). See [VAULT_SYSTEM_MASTER_TECHNICAL_PLAN](../../../docs/VAULT_SYSTEM_MASTER_TECHNICAL_PLAN.md) §9.
|
||
|
||
#### Flash Loan Containment Checklist
|
||
|
||
- Use **TWAP deviation detection** (not single-block price).
|
||
- **Ignore single-block imbalance** for stabilizer triggers.
|
||
- Require **sustained deviation for N blocks** before rebalancing.
|
||
- **Cap per-block stabilization volume** to limit flash-driven execution.
|
||
- **Target:** Flash drain recovery <3 blocks (per Master Plan §16).
|
||
- **On-chain:** The [Stabilizer](../../contracts/bridge/trustless/integration/Stabilizer.sol) (Phase 3 + 6) implements block delay, sustained-deviation buffer, per-block volume cap, and slippage/gas checks; deploy and configure per [CONTRACT_DEPLOYMENT_RUNBOOK](../../../docs/03-deployment/CONTRACT_DEPLOYMENT_RUNBOOK.md) § Stabilizer.
|
||
|
||
---
|
||
|
||
## 2. Weekly Operations
|
||
|
||
### 2.1 Reserve Attestation
|
||
|
||
#### Weekly Reserve Report
|
||
1. **Collect Custodial Balances**
|
||
- USDW: Check USD custodial account
|
||
- EURW: Check EUR custodial account
|
||
- GBPW: Check GBP custodial account
|
||
|
||
2. **Submit Oracle Reports**
|
||
```solidity
|
||
reserveOracle.submitReserveReport(
|
||
tokenAddress,
|
||
reserveBalance,
|
||
block.timestamp
|
||
);
|
||
```
|
||
|
||
3. **Verify Consensus**
|
||
- Ensure quorum is met
|
||
- Verify consensus matches custodial balance
|
||
|
||
4. **Publish Proof-of-Reserves**
|
||
- Generate Merkle tree of reserves
|
||
- Publish on-chain hash
|
||
- Update public dashboard
|
||
|
||
---
|
||
|
||
### 2.2 System Health Review
|
||
|
||
#### Review Metrics
|
||
- Total vaults created
|
||
- Total collateral locked
|
||
- Total debt issued
|
||
- W token supply per currency
|
||
- Reserve ratios
|
||
- Bridge operations count
|
||
- Success rates
|
||
|
||
#### Generate Report
|
||
- Weekly operations report
|
||
- Reserve attestation report
|
||
- Compliance status report
|
||
|
||
---
|
||
|
||
## 3. Monthly Operations
|
||
|
||
### 3.1 Security Review
|
||
|
||
#### Access Control Audit
|
||
1. Review all role assignments
|
||
2. Verify principle of least privilege
|
||
3. Check for unused roles
|
||
4. Review multi-sig configurations
|
||
|
||
#### Compliance Audit
|
||
1. Verify money multiplier = 1.0 (all W tokens)
|
||
2. Verify GRU isolation (no GRU conversions)
|
||
3. Verify ISO-4217 compliance
|
||
4. Review reserve attestations
|
||
|
||
#### Code Review
|
||
1. Review recent changes
|
||
2. Check for security updates
|
||
3. Review dependency updates
|
||
4. Verify test coverage
|
||
|
||
---
|
||
|
||
### 3.2 Performance Review
|
||
|
||
#### Gas Optimization
|
||
- Review gas usage trends
|
||
- Identify optimization opportunities
|
||
- Test optimization proposals
|
||
|
||
#### System Performance
|
||
- Review transaction throughput
|
||
- Check oracle update frequency
|
||
- Review bridge settlement times
|
||
- Analyze user patterns
|
||
|
||
---
|
||
|
||
## 4. Emergency Procedures
|
||
|
||
### 4.1 Reserve Shortfall (W Tokens)
|
||
|
||
#### Symptoms
|
||
- Reserve < Supply for any W token
|
||
- Money multiplier < 1.0
|
||
- Reserve verification fails
|
||
|
||
#### Immediate Actions
|
||
1. **Halt Minting**
|
||
```solidity
|
||
// Disable mint controller
|
||
mintController.revokeRole(keccak256("MINTER_ROLE"), minterAddress);
|
||
```
|
||
|
||
2. **Alert Team**
|
||
- Notify operations team
|
||
- Notify compliance team
|
||
- Prepare public statement
|
||
|
||
3. **Investigate**
|
||
- Check custodial account balance
|
||
- Verify oracle reports
|
||
- Check for accounting errors
|
||
|
||
4. **Remediation**
|
||
- If accounting error: Correct and resume
|
||
- If actual shortfall: Add reserves or halt operations
|
||
- If oracle issue: Fix oracle and resume
|
||
|
||
#### Recovery Steps
|
||
1. Verify reserve restored
|
||
2. Re-enable minting
|
||
3. Resume normal operations
|
||
4. Post-mortem review
|
||
|
||
---
|
||
|
||
### 4.2 Vault Liquidation Event
|
||
|
||
#### Symptoms
|
||
- Vault health ratio < 110%
|
||
- Liquidation triggered
|
||
|
||
#### Immediate Actions
|
||
1. **Verify Liquidation**
|
||
```bash
|
||
cast call $LIQUIDATION_ADDRESS "canLiquidate(address)" $VAULT_ADDRESS --rpc-url $RPC_URL
|
||
```
|
||
|
||
2. **Monitor Liquidation**
|
||
- Track liquidation events
|
||
- Verify collateral seized
|
||
- Verify debt repaid
|
||
|
||
3. **Post-Liquidation**
|
||
- Check remaining vault health
|
||
- Verify system stability
|
||
- Notify vault owner
|
||
|
||
---
|
||
|
||
### 4.3 Bridge Failure
|
||
|
||
#### Symptoms
|
||
- Bridge transaction fails
|
||
- Settlement timeout
|
||
- Reserve verification fails on bridge
|
||
|
||
#### Immediate Actions
|
||
1. **Check Bridge Status**
|
||
```bash
|
||
cast call $BRIDGE_REGISTRY "destinations(uint256)" $CHAIN_ID --rpc-url $RPC_URL
|
||
```
|
||
|
||
2. **Investigate Failure**
|
||
- Check transaction logs
|
||
- Verify destination chain status
|
||
- Check reserve verification
|
||
|
||
3. **Initiate Refund** (if timeout)
|
||
```solidity
|
||
bridgeEscrowVault.initiateRefund(refundRequest, hsmSigner);
|
||
bridgeEscrowVault.executeRefund(transferId);
|
||
```
|
||
|
||
4. **Resume Operations**
|
||
- Fix underlying issue
|
||
- Re-enable bridge route
|
||
- Resume normal operations
|
||
|
||
---
|
||
|
||
### 4.4 Oracle Failure
|
||
|
||
#### Symptoms
|
||
- Oracle staleness detected
|
||
- Quorum not met
|
||
- Price feed failure
|
||
|
||
#### Immediate Actions
|
||
1. **Check Oracle Status**
|
||
```bash
|
||
cast call $XAU_ORACLE "isFrozen()" --rpc-url $RPC_URL
|
||
cast call $RESERVE_ORACLE "isQuorumMet(address)" $TOKEN_ADDRESS --rpc-url $RPC_URL
|
||
```
|
||
|
||
2. **Freeze System** (if critical)
|
||
```solidity
|
||
xauOracle.freeze();
|
||
// Pause vault operations if needed
|
||
```
|
||
|
||
3. **Fix Oracle**
|
||
- Add new oracle feeds
|
||
- Remove stale reports
|
||
- Restore quorum
|
||
|
||
4. **Resume Operations**
|
||
```solidity
|
||
xauOracle.unfreeze();
|
||
```
|
||
|
||
---
|
||
|
||
### 4.5 Compliance Violation
|
||
|
||
#### Symptoms
|
||
- Money multiplier > 1.0 detected
|
||
- GRU conversion detected
|
||
- ISO-4217 violation
|
||
|
||
#### Immediate Actions
|
||
1. **Halt Operations**
|
||
- Pause minting
|
||
- Pause bridging
|
||
- Freeze affected tokens
|
||
|
||
2. **Investigate**
|
||
- Review transaction history
|
||
- Identify violation source
|
||
- Check compliance guard logs
|
||
|
||
3. **Remediation**
|
||
- Fix violation
|
||
- Restore compliance
|
||
- Resume operations
|
||
|
||
4. **Post-Mortem**
|
||
- Document violation
|
||
- Update compliance rules
|
||
- Prevent recurrence
|
||
|
||
---
|
||
|
||
## 5. Incident Response
|
||
|
||
### 5.1 Incident Classification
|
||
|
||
#### Severity Levels
|
||
|
||
**CRITICAL (P0)**:
|
||
- Reserve < Supply (money multiplier violation)
|
||
- System compromise
|
||
- Complete system failure
|
||
|
||
**HIGH (P1)**:
|
||
- Reserve ratio < 105%
|
||
- Bridge failures > 10%
|
||
- Oracle quorum failure
|
||
|
||
**MEDIUM (P2)**:
|
||
- Reserve ratio < 110%
|
||
- Bridge failures 5-10%
|
||
- Single oracle failure
|
||
|
||
**LOW (P3)**:
|
||
- Minor performance issues
|
||
- Non-critical alerts
|
||
- Documentation updates
|
||
|
||
---
|
||
|
||
### 5.2 Incident Response Process
|
||
|
||
#### Step 1: Detection
|
||
- Monitor alerts
|
||
- Review logs
|
||
- User reports
|
||
|
||
#### Step 2: Assessment
|
||
- Classify severity
|
||
- Assess impact
|
||
- Identify root cause
|
||
|
||
#### Step 3: Containment
|
||
- Apply emergency procedures
|
||
- Halt affected operations
|
||
- Isolate issue
|
||
|
||
#### Step 4: Resolution
|
||
- Fix root cause
|
||
- Restore operations
|
||
- Verify fix
|
||
|
||
#### Step 5: Post-Mortem
|
||
- Document incident
|
||
- Identify improvements
|
||
- Update procedures
|
||
|
||
---
|
||
|
||
## 6. Backup & Recovery
|
||
|
||
### 6.1 Backup Procedures
|
||
|
||
#### Daily Backups
|
||
- Contract state snapshots
|
||
- Configuration backups
|
||
- Access control backups
|
||
|
||
#### Weekly Backups
|
||
- Complete system state
|
||
- Oracle configuration
|
||
- Compliance rules
|
||
|
||
#### Monthly Backups
|
||
- Full system archive
|
||
- Historical data
|
||
- Audit logs
|
||
|
||
---
|
||
|
||
### 6.2 Recovery Procedures
|
||
|
||
#### Contract State Recovery
|
||
1. Identify backup point
|
||
2. Restore contract state
|
||
3. Verify restoration
|
||
4. Resume operations
|
||
|
||
#### Configuration Recovery
|
||
1. Restore configuration files
|
||
2. Verify settings
|
||
3. Test functionality
|
||
4. Resume operations
|
||
|
||
---
|
||
|
||
## 7. Monitoring Setup
|
||
|
||
### 7.1 Key Metrics
|
||
|
||
#### Vault System Metrics
|
||
- Total vaults
|
||
- Total collateral (by asset)
|
||
- Total debt (by currency)
|
||
- Average health ratio
|
||
- Liquidation events
|
||
|
||
#### W Token Metrics
|
||
- Supply per token (USDW, EURW, etc.)
|
||
- Reserve balance per token
|
||
- Reserve ratio per token
|
||
- Mint/burn events
|
||
- Redemption events
|
||
|
||
#### Bridge Metrics
|
||
- Bridge success rate
|
||
- Average settlement time
|
||
- Reserve verification success rate
|
||
- Compliance check success rate
|
||
- Transfer volume
|
||
|
||
---
|
||
|
||
### 7.2 Alert Configuration
|
||
|
||
#### Critical Alerts
|
||
```yaml
|
||
- name: Reserve Shortfall
|
||
condition: reserveRatio < 100%
|
||
action: halt_minting
|
||
|
||
- name: Money Multiplier Violation
|
||
condition: reserve < supply
|
||
action: emergency_pause
|
||
|
||
- name: Bridge Failure Rate High
|
||
condition: successRate < 90%
|
||
action: alert_team
|
||
```
|
||
|
||
#### Warning Alerts
|
||
```yaml
|
||
- name: Reserve Ratio Low
|
||
condition: reserveRatio < 105%
|
||
action: alert_team
|
||
|
||
- name: Vault Health Low
|
||
condition: healthRatio < 120%
|
||
action: alert_team
|
||
|
||
- name: Oracle Staleness
|
||
condition: reportAge > 1hour
|
||
action: alert_team
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Operational Checklists
|
||
|
||
### 8.1 Daily Checklist
|
||
|
||
- [ ] Check all reserve ratios (W tokens)
|
||
- [ ] Verify oracle quorum status
|
||
- [ ] Check vault health ratios
|
||
- [ ] Review bridge success rates
|
||
- [ ] Check for critical alerts
|
||
- [ ] Review error logs
|
||
|
||
### 8.2 Weekly Checklist
|
||
|
||
- [ ] Submit reserve attestations
|
||
- [ ] Review system metrics
|
||
- [ ] Check access control roles
|
||
- [ ] Review compliance status
|
||
- [ ] Generate weekly report
|
||
- [ ] Update documentation
|
||
|
||
### 8.3 Monthly Checklist
|
||
|
||
- [ ] Security review
|
||
- [ ] Compliance audit
|
||
- [ ] Performance review
|
||
- [ ] Backup verification
|
||
- [ ] Update procedures
|
||
- [ ] Team training
|
||
|
||
---
|
||
|
||
## 9. Contact Information
|
||
|
||
### Emergency Contacts
|
||
- **Operations Team**: [Contact Info]
|
||
- **Security Team**: [Contact Info]
|
||
- **Compliance Team**: [Contact Info]
|
||
- **On-Call Engineer**: [Contact Info]
|
||
|
||
### Escalation Path
|
||
1. Operations Team (First Response)
|
||
2. Security Team (Security Issues)
|
||
3. Compliance Team (Compliance Issues)
|
||
4. Management (Critical Issues)
|
||
|
||
---
|
||
|
||
**Last Updated**: Operations Runbook Complete
|