Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands - CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround - CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check - NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere - MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates - LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference Co-authored-by: Cursor <cursoragent@cursor.com>
199 lines
5.3 KiB
Markdown
199 lines
5.3 KiB
Markdown
# Solution: QBFT Quorum Loss - Network Stalled
|
|
|
|
**Last Updated:** 2026-01-31
|
|
**Document Version:** 1.0
|
|
**Status:** Active Documentation
|
|
|
|
---
|
|
|
|
**Date**: 2026-01-24
|
|
**Status**: 🔴 **CRITICAL - ROOT CAUSE IDENTIFIED**
|
|
|
|
---
|
|
|
|
## 🎯 Root Cause Found
|
|
|
|
**The network has stopped because we lost QBFT validator quorum.**
|
|
|
|
### The Numbers
|
|
- **Genesis configuration**: 5 validators (192.168.11.100-104)
|
|
- **Currently active**: Only 2 validators (VMIDs 1003, 1004)
|
|
- **Required for consensus**: Minimum 4 validators (⅔ + 1 of 5)
|
|
- **Validators lost**: 3 out of 5 (60%)
|
|
|
|
### Why Network Stalled
|
|
From Besu QBFT documentation:
|
|
> "Configure your network to ensure you never lose more than 1/3 of your validators. If more than 1/3 of validators stop participating, the network stops creating new blocks and stalls."
|
|
|
|
**We lost 60% of validators, far exceeding the 33% threshold.**
|
|
|
|
---
|
|
|
|
## 📊 Current Network State
|
|
|
|
### Missing Validators
|
|
| IP | Status | Evidence |
|
|
|----|--------|----------|
|
|
| 192.168.11.100 | ❌ Not running | No RPC endpoint |
|
|
| 192.168.11.101 | ❌ Not running | No RPC endpoint |
|
|
| 192.168.11.102 | ❌ Not running | No RPC endpoint |
|
|
|
|
### Active Validators
|
|
| VMID | IP | Status |
|
|
|------|----|---------|
|
|
| 1003 | 192.168.11.103 | ✅ Running (stuck in sync) |
|
|
| 1004 | 192.168.11.104 | ✅ Running (stuck in sync) |
|
|
|
|
### What's Happening
|
|
1. Validators 1003 & 1004 are running but can't produce blocks
|
|
2. QBFT requires 4 out of 5 validators to reach consensus
|
|
3. With only 2 active, consensus is impossible
|
|
4. Validators are "stuck in sync" waiting for consensus
|
|
5. Network is deadlocked
|
|
|
|
---
|
|
|
|
## 🔧 Solution Options
|
|
|
|
### Option 1: Reduce Validator Count (RECOMMENDED - Fast)
|
|
|
|
Update genesis to only include the 2 working validators (1003, 1004).
|
|
|
|
**Pros**:
|
|
- Fast implementation
|
|
- Uses existing working validators
|
|
- Network can resume immediately
|
|
|
|
**Cons**:
|
|
- Lower Byzantine fault tolerance (need both validators)
|
|
- Less decentralized
|
|
|
|
**Steps**:
|
|
1. Stop validators 1003 & 1004
|
|
2. Update genesis extraData to only include validators 103 & 104
|
|
3. Update static-nodes.json and permissioned-nodes.json
|
|
4. Restart validators
|
|
5. Network should resume
|
|
|
|
### Option 2: Start Missing Validators (IDEAL - Slower)
|
|
|
|
Find and start validators 1000, 1001, 1002 to restore full quorum.
|
|
|
|
**Pros**:
|
|
- Maintains Byzantine fault tolerance
|
|
- Network continues as originally designed
|
|
- Can lose 1 validator and still operate
|
|
|
|
**Cons**:
|
|
- Need to locate where these validators are/were
|
|
- May need to redeploy them
|
|
- Takes more time
|
|
|
|
**Steps**:
|
|
1. Find if validators 1000-1002 exist on other Proxmox hosts
|
|
2. If not, deploy new validators with correct keys
|
|
3. Configure them with proper genesis
|
|
4. Start them
|
|
5. Network should resume when quorum is met
|
|
|
|
---
|
|
|
|
## 🚀 Recommended Action: Option 1
|
|
|
|
Since we need to resume the network quickly for bridge operations, implement Option 1:
|
|
|
|
### Step 1: Create New Genesis ExtraData
|
|
|
|
Current extraData includes 5 validators. We need to generate new extraData with only 2:
|
|
- 192.168.11.103 (validator 1003)
|
|
- 192.168.11.104 (validator 1004)
|
|
|
|
### Step 2: Update Static & Permissioned Nodes
|
|
|
|
Remove enodes for 192.168.11.100-102 from:
|
|
- `/etc/besu/static-nodes.json`
|
|
- `/etc/besu/permissioned-nodes.json`
|
|
|
|
Keep only:
|
|
- 192.168.11.103 (validator 1003)
|
|
- 192.168.11.104 (validator 1004)
|
|
- RPC and sentry nodes
|
|
|
|
### Step 3: Restart Validators
|
|
|
|
With updated config, validators should:
|
|
- Skip full sync (already synced)
|
|
- Form quorum with 2/2 validators
|
|
- Resume block production
|
|
|
|
---
|
|
|
|
## 📝 Technical Details
|
|
|
|
### QBFT Quorum Math
|
|
```
|
|
Validators: N = 5
|
|
Byzantine Fault Tolerance: F = (N - 1) / 3 = 1.33 ≈ 1
|
|
Required for Consensus: 2F + 1 = 3
|
|
|
|
But with 5 validators, need ceiling(5 * 2/3) = ceiling(3.33) = 4
|
|
```
|
|
|
|
### Why 2 Validators Will Work
|
|
```
|
|
Validators: N = 2
|
|
Byzantine Fault Tolerance: F = (N - 1) / 3 = 0.33 ≈ 0
|
|
Required for Consensus: 2F + 1 = 1
|
|
|
|
With 2 validators, need ceiling(2 * 2/3) = ceiling(1.33) = 2
|
|
Both validators must be active, but that's what we have!
|
|
```
|
|
|
|
### Limitation with 2 Validators
|
|
- Cannot tolerate ANY validator failure
|
|
- If one validator goes down, network stops
|
|
- Not Byzantine fault tolerant
|
|
- **But it will work for bridge operations**
|
|
|
|
---
|
|
|
|
## ⚠️ Important Notes
|
|
|
|
### After Resuming Network
|
|
1. **Test immediately**: Send a transaction to verify blocks produce
|
|
2. **Monitor closely**: Watch both validators
|
|
3. **Plan for redundancy**: Consider adding more validators later
|
|
4. **Document**: Note that network now has reduced fault tolerance
|
|
|
|
### Future Improvements
|
|
1. Deploy 3 more validators to reach 5 total
|
|
2. This provides 1 Byzantine fault tolerance
|
|
3. Network can survive 1 validator failure
|
|
|
|
---
|
|
|
|
## 🎯 Next Steps
|
|
|
|
1. ✅ Root cause identified: Quorum loss
|
|
2. ⏳ Generate new genesis with 2 validators
|
|
3. ⏳ Update node lists
|
|
4. ⏳ Restart validators
|
|
5. ⏳ Verify blocks resume
|
|
6. ⏳ Test bridge transaction
|
|
|
|
---
|
|
|
|
## 📚 References
|
|
|
|
- [Besu QBFT Documentation](https://besu.hyperledger.org/23.10.0/private-networks/how-to/configure/consensus/qbft)
|
|
- QBFT requires: "Configure your network to ensure you never lose more than 1/3 of your validators"
|
|
- Minimum validators for Byzantine fault tolerance: 4
|
|
|
|
---
|
|
|
|
**Status**: Root cause confirmed, solution ready to implement
|
|
**Blocker**: Insufficient validator quorum (2/5 vs 4/5 required)
|
|
**Resolution**: Reduce validator count to 2 or start 3 missing validators
|
|
|
|
**Last Updated**: 2026-01-24 01:32 PST
|