Files
smom-dbis-138/docs/OPERATIONS_RUNBOOK.md
2026-03-02 12:14:09 -08:00

14 KiB
Raw Permalink Blame History

Operations Runbook - Complete System

Date: Operations Runbook
Status: COMPLETE


Overview

This runbook provides operational procedures for:

  1. Vault System Operations
  2. ISO-4217 W Token System Operations
  3. Bridge System Operations
  4. Emergency Procedures

1. Daily Operations

1.1 Vault System Monitoring

Health Check

# Check vault health ratios
cast call $LEDGER_ADDRESS "getVaultHealth(address)" $VAULT_ADDRESS --rpc-url $RPC_URL

# Check total collateral
cast call $LEDGER_ADDRESS "totalCollateral(address)" $ASSET_ADDRESS --rpc-url $RPC_URL

# Check total debt
cast call $LEDGER_ADDRESS "totalDebt(address)" $CURRENCY_ADDRESS --rpc-url $RPC_URL

Alert Thresholds

  • Health Ratio < 120%: Warning alert
  • Health Ratio < 110%: Critical alert (liquidation threshold)
  • Debt Ceiling > 90%: Warning alert
  • Oracle Staleness > 1 hour: Critical alert

1.2 ISO-4217 W Token Monitoring

Reserve Verification

# Check reserve sufficiency for USDW
cast call $USDW_ADDRESS "isReserveSufficient()" --rpc-url $RPC_URL

# Get reserve balance
cast call $USDW_ADDRESS "verifiedReserve()" --rpc-url $RPC_URL

# Get total supply
cast call $USDW_ADDRESS "totalSupply()" --rpc-url $RPC_URL

# Calculate reserve ratio
# Reserve Ratio = (verifiedReserve / totalSupply) * 100

Daily Reserve Check

  1. Check Reserve Oracle Reports

    cast call $RESERVE_ORACLE "getVerifiedReserve(address)" $USDW_ADDRESS --rpc-url $RPC_URL
    
  2. Verify Quorum

    cast call $RESERVE_ORACLE "isQuorumMet(address)" $USDW_ADDRESS --rpc-url $RPC_URL
    
  3. Check for Stale Reports

    • Reports older than 1 hour should be removed
    • If quorum not met, investigate oracle issues

Alert Thresholds

  • Reserve Ratio < 100%: CRITICAL - Minting must halt
  • Reserve Ratio < 105%: Warning alert
  • Oracle Quorum Not Met: Critical alert
  • Stale Reports Detected: Warning alert

1.3 Bridge System Monitoring

Bridge Health Metrics

# Check bridge success rate
# Query bridge events for success/failure counts

# Check settlement times
# Monitor TransferStatusUpdated events

# Check reserve verification failures
# Monitor ReserveVerified events with sufficient=false

Alert Thresholds

  • Success Rate < 95%: Warning alert
  • Success Rate < 90%: Critical alert
  • Settlement Time > 1 hour: Warning alert
  • Reserve Verification Failures: Critical alert
  • Compliance Violations: Critical alert

1.4 Reserve and Stabilization Policies (VAULT_SYSTEM_MASTER_TECHNICAL_PLAN)

The following formulas and checklists are from VAULT_SYSTEM_MASTER_TECHNICAL_PLAN. Use them for sizing and operational verification.

Reserve Sizing Model

  • Variables: PeakMinuteOutflow = P, StabilizationWindow = T (minutes).
  • Required reserve: Reserve ≥ P × T.
  • Recommended safety factor: 35× peak minute outflow.
  • Example: P = 10,000, T = 5 min → Reserve ≥ 50,000; with 3× safety → 150,000.

Cantilever Stabilization Model

  • Condition: s × f ≥ Δ (s = micro trade size, f = micro trade frequency, Δ = net imbalance per minute).
  • Dynamic rule: If deviation > θ, set s = k × deviation (eliminates fixed frequency dependency).
  • Use: Size and frequency of stabilization trades so throughput offsets macro flow.

Bridge Liquidity Buffer

  • Rule: BridgeReserve ≥ PeakBridgeOutflow × Latency (where Latency = bridge settlement time).
  • Use: Ensure cross-chain bridge buffers satisfy this so outflows do not exhaust reserves during settlement.

Cross-chain parity and bridge buffer

  • Objective: Maintain |Price138 Price651940| < ArbitrageThreshold (see CROSS_CHAIN_ARBITRAGE_DESIGN). Cross-chain private arbitrage bots execute when deviation exceeds threshold; bridge reserve must be sized so outflows do not exhaust reserves during settlement.
  • Bridge buffer formula: BridgeReserve ≥ PeakBridgeOutflow × Latency.
    • PeakBridgeOutflow: Measure from bridge events (e.g. lock/release or TransferInitiated volume) over a rolling window (e.g. peak hourly or daily outflow in USD or token units).
    • Latency: Bridge settlement time (e.g. typical time from lock on source chain to release on destination, in minutes or blocks). Use historical median or P95.
  • Sizing steps: (1) Query bridge contract events for initiated/released amounts per time window; (2) compute peak outflow; (3) measure typical settlement latency; (4) set minimum reserve = Peak × Latency; (5) add safety factor (e.g. 1.52×) and document in runbook.
  • Alert when reserve below: If BridgeReserve < PeakBridgeOutflow × Latency (or below safety threshold), trigger Warning alert. If reserve is falling and may breach within one settlement window, escalate to Critical. Integrate with existing monitoring (e.g. Prometheus + PagerDuty when monitoring stack is deployed). See VAULT_SYSTEM_MASTER_TECHNICAL_PLAN §9.

Flash Loan Containment Checklist

  • Use TWAP deviation detection (not single-block price).
  • Ignore single-block imbalance for stabilizer triggers.
  • Require sustained deviation for N blocks before rebalancing.
  • Cap per-block stabilization volume to limit flash-driven execution.
  • Target: Flash drain recovery <3 blocks (per Master Plan §16).
  • On-chain: The Stabilizer (Phase 3 + 6) implements block delay, sustained-deviation buffer, per-block volume cap, and slippage/gas checks; deploy and configure per CONTRACT_DEPLOYMENT_RUNBOOK § Stabilizer.

2. Weekly Operations

2.1 Reserve Attestation

Weekly Reserve Report

  1. Collect Custodial Balances

    • USDW: Check USD custodial account
    • EURW: Check EUR custodial account
    • GBPW: Check GBP custodial account
  2. Submit Oracle Reports

    reserveOracle.submitReserveReport(
        tokenAddress,
        reserveBalance,
        block.timestamp
    );
    
  3. Verify Consensus

    • Ensure quorum is met
    • Verify consensus matches custodial balance
  4. Publish Proof-of-Reserves

    • Generate Merkle tree of reserves
    • Publish on-chain hash
    • Update public dashboard

2.2 System Health Review

Review Metrics

  • Total vaults created
  • Total collateral locked
  • Total debt issued
  • W token supply per currency
  • Reserve ratios
  • Bridge operations count
  • Success rates

Generate Report

  • Weekly operations report
  • Reserve attestation report
  • Compliance status report

3. Monthly Operations

3.1 Security Review

Access Control Audit

  1. Review all role assignments
  2. Verify principle of least privilege
  3. Check for unused roles
  4. Review multi-sig configurations

Compliance Audit

  1. Verify money multiplier = 1.0 (all W tokens)
  2. Verify GRU isolation (no GRU conversions)
  3. Verify ISO-4217 compliance
  4. Review reserve attestations

Code Review

  1. Review recent changes
  2. Check for security updates
  3. Review dependency updates
  4. Verify test coverage

3.2 Performance Review

Gas Optimization

  • Review gas usage trends
  • Identify optimization opportunities
  • Test optimization proposals

System Performance

  • Review transaction throughput
  • Check oracle update frequency
  • Review bridge settlement times
  • Analyze user patterns

4. Emergency Procedures

4.1 Reserve Shortfall (W Tokens)

Symptoms

  • Reserve < Supply for any W token
  • Money multiplier < 1.0
  • Reserve verification fails

Immediate Actions

  1. Halt Minting

    // Disable mint controller
    mintController.revokeRole(keccak256("MINTER_ROLE"), minterAddress);
    
  2. Alert Team

    • Notify operations team
    • Notify compliance team
    • Prepare public statement
  3. Investigate

    • Check custodial account balance
    • Verify oracle reports
    • Check for accounting errors
  4. Remediation

    • If accounting error: Correct and resume
    • If actual shortfall: Add reserves or halt operations
    • If oracle issue: Fix oracle and resume

Recovery Steps

  1. Verify reserve restored
  2. Re-enable minting
  3. Resume normal operations
  4. Post-mortem review

4.2 Vault Liquidation Event

Symptoms

  • Vault health ratio < 110%
  • Liquidation triggered

Immediate Actions

  1. Verify Liquidation

    cast call $LIQUIDATION_ADDRESS "canLiquidate(address)" $VAULT_ADDRESS --rpc-url $RPC_URL
    
  2. Monitor Liquidation

    • Track liquidation events
    • Verify collateral seized
    • Verify debt repaid
  3. Post-Liquidation

    • Check remaining vault health
    • Verify system stability
    • Notify vault owner

4.3 Bridge Failure

Symptoms

  • Bridge transaction fails
  • Settlement timeout
  • Reserve verification fails on bridge

Immediate Actions

  1. Check Bridge Status

    cast call $BRIDGE_REGISTRY "destinations(uint256)" $CHAIN_ID --rpc-url $RPC_URL
    
  2. Investigate Failure

    • Check transaction logs
    • Verify destination chain status
    • Check reserve verification
  3. Initiate Refund (if timeout)

    bridgeEscrowVault.initiateRefund(refundRequest, hsmSigner);
    bridgeEscrowVault.executeRefund(transferId);
    
  4. Resume Operations

    • Fix underlying issue
    • Re-enable bridge route
    • Resume normal operations

4.4 Oracle Failure

Symptoms

  • Oracle staleness detected
  • Quorum not met
  • Price feed failure

Immediate Actions

  1. Check Oracle Status

    cast call $XAU_ORACLE "isFrozen()" --rpc-url $RPC_URL
    cast call $RESERVE_ORACLE "isQuorumMet(address)" $TOKEN_ADDRESS --rpc-url $RPC_URL
    
  2. Freeze System (if critical)

    xauOracle.freeze();
    // Pause vault operations if needed
    
  3. Fix Oracle

    • Add new oracle feeds
    • Remove stale reports
    • Restore quorum
  4. Resume Operations

    xauOracle.unfreeze();
    

4.5 Compliance Violation

Symptoms

  • Money multiplier > 1.0 detected
  • GRU conversion detected
  • ISO-4217 violation

Immediate Actions

  1. Halt Operations

    • Pause minting
    • Pause bridging
    • Freeze affected tokens
  2. Investigate

    • Review transaction history
    • Identify violation source
    • Check compliance guard logs
  3. Remediation

    • Fix violation
    • Restore compliance
    • Resume operations
  4. Post-Mortem

    • Document violation
    • Update compliance rules
    • Prevent recurrence

5. Incident Response

5.1 Incident Classification

Severity Levels

CRITICAL (P0):

  • Reserve < Supply (money multiplier violation)
  • System compromise
  • Complete system failure

HIGH (P1):

  • Reserve ratio < 105%
  • Bridge failures > 10%
  • Oracle quorum failure

MEDIUM (P2):

  • Reserve ratio < 110%
  • Bridge failures 5-10%
  • Single oracle failure

LOW (P3):

  • Minor performance issues
  • Non-critical alerts
  • Documentation updates

5.2 Incident Response Process

Step 1: Detection

  • Monitor alerts
  • Review logs
  • User reports

Step 2: Assessment

  • Classify severity
  • Assess impact
  • Identify root cause

Step 3: Containment

  • Apply emergency procedures
  • Halt affected operations
  • Isolate issue

Step 4: Resolution

  • Fix root cause
  • Restore operations
  • Verify fix

Step 5: Post-Mortem

  • Document incident
  • Identify improvements
  • Update procedures

6. Backup & Recovery

6.1 Backup Procedures

Daily Backups

  • Contract state snapshots
  • Configuration backups
  • Access control backups

Weekly Backups

  • Complete system state
  • Oracle configuration
  • Compliance rules

Monthly Backups

  • Full system archive
  • Historical data
  • Audit logs

6.2 Recovery Procedures

Contract State Recovery

  1. Identify backup point
  2. Restore contract state
  3. Verify restoration
  4. Resume operations

Configuration Recovery

  1. Restore configuration files
  2. Verify settings
  3. Test functionality
  4. Resume operations

7. Monitoring Setup

7.1 Key Metrics

Vault System Metrics

  • Total vaults
  • Total collateral (by asset)
  • Total debt (by currency)
  • Average health ratio
  • Liquidation events

W Token Metrics

  • Supply per token (USDW, EURW, etc.)
  • Reserve balance per token
  • Reserve ratio per token
  • Mint/burn events
  • Redemption events

Bridge Metrics

  • Bridge success rate
  • Average settlement time
  • Reserve verification success rate
  • Compliance check success rate
  • Transfer volume

7.2 Alert Configuration

Critical Alerts

- name: Reserve Shortfall
  condition: reserveRatio < 100%
  action: halt_minting
  
- name: Money Multiplier Violation
  condition: reserve < supply
  action: emergency_pause
  
- name: Bridge Failure Rate High
  condition: successRate < 90%
  action: alert_team

Warning Alerts

- name: Reserve Ratio Low
  condition: reserveRatio < 105%
  action: alert_team
  
- name: Vault Health Low
  condition: healthRatio < 120%
  action: alert_team
  
- name: Oracle Staleness
  condition: reportAge > 1hour
  action: alert_team

8. Operational Checklists

8.1 Daily Checklist

  • Check all reserve ratios (W tokens)
  • Verify oracle quorum status
  • Check vault health ratios
  • Review bridge success rates
  • Check for critical alerts
  • Review error logs

8.2 Weekly Checklist

  • Submit reserve attestations
  • Review system metrics
  • Check access control roles
  • Review compliance status
  • Generate weekly report
  • Update documentation

8.3 Monthly Checklist

  • Security review
  • Compliance audit
  • Performance review
  • Backup verification
  • Update procedures
  • Team training

9. Contact Information

Emergency Contacts

  • Operations Team: [Contact Info]
  • Security Team: [Contact Info]
  • Compliance Team: [Contact Info]
  • On-Call Engineer: [Contact Info]

Escalation Path

  1. Operations Team (First Response)
  2. Security Team (Security Issues)
  3. Compliance Team (Compliance Issues)
  4. Management (Critical Issues)

Last Updated: Operations Runbook Complete