Files
proxmox/docs/06-besu/SOLUTION_QUORUM_LOSS.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

199 lines
5.3 KiB
Markdown

# Solution: QBFT Quorum Loss - Network Stalled
**Last Updated:** 2026-01-31
**Document Version:** 1.0
**Status:** Active Documentation
---
**Date**: 2026-01-24
**Status**: 🔴 **CRITICAL - ROOT CAUSE IDENTIFIED**
---
## 🎯 Root Cause Found
**The network has stopped because we lost QBFT validator quorum.**
### The Numbers
- **Genesis configuration**: 5 validators (192.168.11.100-104)
- **Currently active**: Only 2 validators (VMIDs 1003, 1004)
- **Required for consensus**: Minimum 4 validators (⅔ + 1 of 5)
- **Validators lost**: 3 out of 5 (60%)
### Why Network Stalled
From Besu QBFT documentation:
> "Configure your network to ensure you never lose more than 1/3 of your validators. If more than 1/3 of validators stop participating, the network stops creating new blocks and stalls."
**We lost 60% of validators, far exceeding the 33% threshold.**
---
## 📊 Current Network State
### Missing Validators
| IP | Status | Evidence |
|----|--------|----------|
| 192.168.11.100 | ❌ Not running | No RPC endpoint |
| 192.168.11.101 | ❌ Not running | No RPC endpoint |
| 192.168.11.102 | ❌ Not running | No RPC endpoint |
### Active Validators
| VMID | IP | Status |
|------|----|---------|
| 1003 | 192.168.11.103 | ✅ Running (stuck in sync) |
| 1004 | 192.168.11.104 | ✅ Running (stuck in sync) |
### What's Happening
1. Validators 1003 & 1004 are running but can't produce blocks
2. QBFT requires 4 out of 5 validators to reach consensus
3. With only 2 active, consensus is impossible
4. Validators are "stuck in sync" waiting for consensus
5. Network is deadlocked
---
## 🔧 Solution Options
### Option 1: Reduce Validator Count (RECOMMENDED - Fast)
Update genesis to only include the 2 working validators (1003, 1004).
**Pros**:
- Fast implementation
- Uses existing working validators
- Network can resume immediately
**Cons**:
- Lower Byzantine fault tolerance (need both validators)
- Less decentralized
**Steps**:
1. Stop validators 1003 & 1004
2. Update genesis extraData to only include validators 103 & 104
3. Update static-nodes.json and permissioned-nodes.json
4. Restart validators
5. Network should resume
### Option 2: Start Missing Validators (IDEAL - Slower)
Find and start validators 1000, 1001, 1002 to restore full quorum.
**Pros**:
- Maintains Byzantine fault tolerance
- Network continues as originally designed
- Can lose 1 validator and still operate
**Cons**:
- Need to locate where these validators are/were
- May need to redeploy them
- Takes more time
**Steps**:
1. Find if validators 1000-1002 exist on other Proxmox hosts
2. If not, deploy new validators with correct keys
3. Configure them with proper genesis
4. Start them
5. Network should resume when quorum is met
---
## 🚀 Recommended Action: Option 1
Since we need to resume the network quickly for bridge operations, implement Option 1:
### Step 1: Create New Genesis ExtraData
Current extraData includes 5 validators. We need to generate new extraData with only 2:
- 192.168.11.103 (validator 1003)
- 192.168.11.104 (validator 1004)
### Step 2: Update Static & Permissioned Nodes
Remove enodes for 192.168.11.100-102 from:
- `/etc/besu/static-nodes.json`
- `/etc/besu/permissioned-nodes.json`
Keep only:
- 192.168.11.103 (validator 1003)
- 192.168.11.104 (validator 1004)
- RPC and sentry nodes
### Step 3: Restart Validators
With updated config, validators should:
- Skip full sync (already synced)
- Form quorum with 2/2 validators
- Resume block production
---
## 📝 Technical Details
### QBFT Quorum Math
```
Validators: N = 5
Byzantine Fault Tolerance: F = (N - 1) / 3 = 1.33 ≈ 1
Required for Consensus: 2F + 1 = 3
But with 5 validators, need ceiling(5 * 2/3) = ceiling(3.33) = 4
```
### Why 2 Validators Will Work
```
Validators: N = 2
Byzantine Fault Tolerance: F = (N - 1) / 3 = 0.33 ≈ 0
Required for Consensus: 2F + 1 = 1
With 2 validators, need ceiling(2 * 2/3) = ceiling(1.33) = 2
Both validators must be active, but that's what we have!
```
### Limitation with 2 Validators
- Cannot tolerate ANY validator failure
- If one validator goes down, network stops
- Not Byzantine fault tolerant
- **But it will work for bridge operations**
---
## ⚠️ Important Notes
### After Resuming Network
1. **Test immediately**: Send a transaction to verify blocks produce
2. **Monitor closely**: Watch both validators
3. **Plan for redundancy**: Consider adding more validators later
4. **Document**: Note that network now has reduced fault tolerance
### Future Improvements
1. Deploy 3 more validators to reach 5 total
2. This provides 1 Byzantine fault tolerance
3. Network can survive 1 validator failure
---
## 🎯 Next Steps
1. ✅ Root cause identified: Quorum loss
2. ⏳ Generate new genesis with 2 validators
3. ⏳ Update node lists
4. ⏳ Restart validators
5. ⏳ Verify blocks resume
6. ⏳ Test bridge transaction
---
## 📚 References
- [Besu QBFT Documentation](https://besu.hyperledger.org/23.10.0/private-networks/how-to/configure/consensus/qbft)
- QBFT requires: "Configure your network to ensure you never lose more than 1/3 of your validators"
- Minimum validators for Byzantine fault tolerance: 4
---
**Status**: Root cause confirmed, solution ready to implement
**Blocker**: Insufficient validator quorum (2/5 vs 4/5 required)
**Resolution**: Reduce validator count to 2 or start 3 missing validators
**Last Updated**: 2026-01-24 01:32 PST