# Bridge Operations Runbook ## Table of Contents 1. [Incident Response](#incident-response) 2. [Common Operations](#common-operations) 3. [Troubleshooting](#troubleshooting) 4. [Emergency Procedures](#emergency-procedures) ## Incident Response ### High Failure Rate **Symptoms:** - Success rate drops below 95% - Multiple failed transfers in short time **Actions:** 1. Check Prometheus metrics: `bridge_success_rate < 95` 2. Review recent transfer logs for error patterns 3. Check destination chain status (RPC availability, finality issues) 4. Verify thirdweb API status 5. Check XRPL connection if XRPL routes affected 6. If issue persists > 10 minutes, pause affected route: ```bash forge script script/bridge/interop/PauseDestination.s.sol \ --rpc-url $RPC_URL \ --private-key $ADMIN_KEY \ --broadcast \ --sig "run(address,uint256)" $REGISTRY_ADDRESS $CHAIN_ID ``` ### Liquidity Failure **Symptoms:** - Transfers failing with "insufficient liquidity" errors - XRPL hot wallet balance low **Actions:** 1. Check XRPL hot wallet balance: ```bash curl -X POST $XRPL_SERVER \ -d '{"method":"account_info","params":[{"account":"$XRPL_ACCOUNT"}]}' ``` 2. Replenish hot wallet if balance < threshold 3. Check EVM destination liquidity pools 4. If critical, pause affected token: ```bash forge script script/bridge/interop/PauseToken.s.sol \ --rpc-url $RPC_URL \ --private-key $ADMIN_KEY \ --broadcast \ --sig "run(address,address)" $REGISTRY_ADDRESS $TOKEN_ADDRESS ``` ### High Settlement Time **Symptoms:** - Average settlement time > 10 minutes - Users reporting slow transfers **Actions:** 1. Check destination chain finality requirements 2. Verify FireFly workflow engine is processing transfers 3. Check Cacti connector status 4. Review route health scores 5. Consider switching to alternative route if available ### Bridge Pause **Symptoms:** - All transfers failing - Bridge status shows "PAUSED" **Actions:** 1. Identify reason for pause (check admin logs) 2. Resolve underlying issue 3. Unpause bridge: ```bash forge script script/bridge/interop/UnpauseBridge.s.sol \ --rpc-url $RPC_URL \ --private-key $ADMIN_KEY \ --broadcast \ --sig "run(address)" $VAULT_ADDRESS ``` ## Common Operations ### Add New Destination 1. Register destination in registry: ```bash forge script script/bridge/interop/RegisterDestination.s.sol \ --rpc-url $RPC_URL \ --private-key $ADMIN_KEY \ --broadcast \ --sig "run(address,uint256,string,uint256,uint256,uint256,address)" \ $REGISTRY_ADDRESS \ $CHAIN_ID \ "Chain Name" \ $MIN_FINALITY_BLOCKS \ $TIMEOUT_SECONDS \ $BASE_FEE_BPS \ $FEE_RECIPIENT ``` 2. Update FireFly configuration 3. Configure Cacti connector if needed 4. Test with small amount transfer ### Add New Token 1. Register token in registry: ```bash forge script script/bridge/interop/RegisterToken.s.sol \ --rpc-url $RPC_URL \ --private-key $ADMIN_KEY \ --broadcast \ --sig "run(address,address,uint256,uint256,uint256[],uint8,uint256)" \ $REGISTRY_ADDRESS \ $TOKEN_ADDRESS \ $MIN_AMOUNT \ $MAX_AMOUNT \ "[137,10,8453]" \ $RISK_LEVEL \ $BRIDGE_FEE_BPS ``` 2. Verify token contract is valid 3. Test with small amount transfer ### Process Refund 1. Verify transfer is eligible for refund: ```bash cast call $VAULT_ADDRESS \ "isRefundable(bytes32)" \ $TRANSFER_ID \ --rpc-url $RPC_URL ``` 2. Initiate refund (requires HSM signature): ```bash # Generate HSM signature first # Then call initiateRefund with signature ``` 3. Execute refund: ```bash forge script script/bridge/interop/ExecuteRefund.s.sol \ --rpc-url $RPC_URL \ --private-key $REFUND_OPERATOR_KEY \ --broadcast \ --sig "run(address,bytes32)" $VAULT_ADDRESS $TRANSFER_ID ``` ### Update Route Health After successful/failed transfer, update route health: ```bash curl -X POST $API_URL/api/admin/update-route-health \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{ "chainId": 137, "token": "0x...", "success": true, "settlementTime": 300 }' ``` ## Troubleshooting ### Transfer Stuck in EXECUTING **Check:** 1. FireFly workflow status 2. Cacti connector logs 3. Destination chain transaction status **Resolution:** - If destination tx confirmed, update status manually - If destination tx failed, mark transfer as FAILED and initiate refund ### HSM Signing Fails **Check:** 1. HSM service health: `curl $HSM_ENDPOINT/health` 2. HSM API key validity 3. Key ID exists and is accessible **Resolution:** - Restart HSM service if needed - Verify HSM key configuration - Check HSM logs for errors ### XRPL Connection Issues **Check:** 1. XRPL server connectivity: `ping xrpl-server` 2. XRPL account balance 3. XRPL network status **Resolution:** - Switch to backup XRPL server if available - Verify XRPL account credentials - Check XRPL network status page ### FireFly Not Processing **Check:** 1. FireFly service status: `kubectl get pods -n firefly` 2. FireFly logs: `kubectl logs -f firefly-core -n firefly` 3. Database connectivity **Resolution:** - Restart FireFly service if needed - Check database connection - Verify FireFly configuration ## Emergency Procedures ### Global Pause If critical security issue detected: ```bash # Pause all contracts forge script script/bridge/interop/EmergencyPause.s.sol \ --rpc-url $RPC_URL \ --private-key $ADMIN_KEY \ --broadcast ``` ### Key Rotation If HSM key compromised: 1. Generate new HSM key 2. Update HSM signer address in contracts: ```bash forge script script/bridge/interop/UpdateHSMSigner.s.sol \ --rpc-url $RPC_URL \ --private-key $ADMIN_KEY \ --broadcast \ --sig "run(address,address)" $CONTROLLER_ADDRESS $NEW_HSM_SIGNER ``` 3. Revoke old key access 4. Test with small operation ### Disaster Recovery If bridge infrastructure fails: 1. **Immediate Actions:** - Pause all bridge operations - Notify users via status page - Assess damage scope 2. **Recovery Steps:** - Restore from backups - Redeploy infrastructure - Verify contract states - Test with small transfers - Gradually resume operations 3. **Post-Incident:** - Document incident - Review logs and metrics - Update runbooks - Conduct post-mortem ## Monitoring Checklist Daily: - [ ] Review success rate metrics - [ ] Check for failed transfers - [ ] Verify XRPL hot wallet balance - [ ] Review alert notifications Weekly: - [ ] Review route health scores - [ ] Analyze settlement time trends - [ ] Check HSM service health - [ ] Review proof-of-reserves Monthly: - [ ] Security audit review - [ ] Update documentation - [ ] Review and update runbooks - [ ] Capacity planning review ## Contact Information **On-Call Engineer:** oncall@chain138.example.com **Security Team:** security@chain138.example.com **DevOps:** devops@chain138.example.com **Emergency Escalation:** 1. Page on-call engineer 2. If no response in 15 minutes, escalate to team lead 3. For security incidents, immediately contact security team