Files
proxmox/docs/06-besu/SOLUTION_QUORUM_LOSS.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

5.3 KiB

Solution: QBFT Quorum Loss - Network Stalled

Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation


Date: 2026-01-24
Status: 🔴 CRITICAL - ROOT CAUSE IDENTIFIED


🎯 Root Cause Found

The network has stopped because we lost QBFT validator quorum.

The Numbers

  • Genesis configuration: 5 validators (192.168.11.100-104)
  • Currently active: Only 2 validators (VMIDs 1003, 1004)
  • Required for consensus: Minimum 4 validators (⅔ + 1 of 5)
  • Validators lost: 3 out of 5 (60%)

Why Network Stalled

From Besu QBFT documentation:

"Configure your network to ensure you never lose more than 1/3 of your validators. If more than 1/3 of validators stop participating, the network stops creating new blocks and stalls."

We lost 60% of validators, far exceeding the 33% threshold.


📊 Current Network State

Missing Validators

IP Status Evidence
192.168.11.100 Not running No RPC endpoint
192.168.11.101 Not running No RPC endpoint
192.168.11.102 Not running No RPC endpoint

Active Validators

VMID IP Status
1003 192.168.11.103 Running (stuck in sync)
1004 192.168.11.104 Running (stuck in sync)

What's Happening

  1. Validators 1003 & 1004 are running but can't produce blocks
  2. QBFT requires 4 out of 5 validators to reach consensus
  3. With only 2 active, consensus is impossible
  4. Validators are "stuck in sync" waiting for consensus
  5. Network is deadlocked

🔧 Solution Options

Update genesis to only include the 2 working validators (1003, 1004).

Pros:

  • Fast implementation
  • Uses existing working validators
  • Network can resume immediately

Cons:

  • Lower Byzantine fault tolerance (need both validators)
  • Less decentralized

Steps:

  1. Stop validators 1003 & 1004
  2. Update genesis extraData to only include validators 103 & 104
  3. Update static-nodes.json and permissioned-nodes.json
  4. Restart validators
  5. Network should resume

Option 2: Start Missing Validators (IDEAL - Slower)

Find and start validators 1000, 1001, 1002 to restore full quorum.

Pros:

  • Maintains Byzantine fault tolerance
  • Network continues as originally designed
  • Can lose 1 validator and still operate

Cons:

  • Need to locate where these validators are/were
  • May need to redeploy them
  • Takes more time

Steps:

  1. Find if validators 1000-1002 exist on other Proxmox hosts
  2. If not, deploy new validators with correct keys
  3. Configure them with proper genesis
  4. Start them
  5. Network should resume when quorum is met

Since we need to resume the network quickly for bridge operations, implement Option 1:

Step 1: Create New Genesis ExtraData

Current extraData includes 5 validators. We need to generate new extraData with only 2:

  • 192.168.11.103 (validator 1003)
  • 192.168.11.104 (validator 1004)

Step 2: Update Static & Permissioned Nodes

Remove enodes for 192.168.11.100-102 from:

  • /etc/besu/static-nodes.json
  • /etc/besu/permissioned-nodes.json

Keep only:

  • 192.168.11.103 (validator 1003)
  • 192.168.11.104 (validator 1004)
  • RPC and sentry nodes

Step 3: Restart Validators

With updated config, validators should:

  • Skip full sync (already synced)
  • Form quorum with 2/2 validators
  • Resume block production

📝 Technical Details

QBFT Quorum Math

Validators: N = 5
Byzantine Fault Tolerance: F = (N - 1) / 3 = 1.33 ≈ 1
Required for Consensus: 2F + 1 = 3

But with 5 validators, need ceiling(5 * 2/3) = ceiling(3.33) = 4

Why 2 Validators Will Work

Validators: N = 2  
Byzantine Fault Tolerance: F = (N - 1) / 3 = 0.33 ≈ 0
Required for Consensus: 2F + 1 = 1

With 2 validators, need ceiling(2 * 2/3) = ceiling(1.33) = 2
Both validators must be active, but that's what we have!

Limitation with 2 Validators

  • Cannot tolerate ANY validator failure
  • If one validator goes down, network stops
  • Not Byzantine fault tolerant
  • But it will work for bridge operations

⚠️ Important Notes

After Resuming Network

  1. Test immediately: Send a transaction to verify blocks produce
  2. Monitor closely: Watch both validators
  3. Plan for redundancy: Consider adding more validators later
  4. Document: Note that network now has reduced fault tolerance

Future Improvements

  1. Deploy 3 more validators to reach 5 total
  2. This provides 1 Byzantine fault tolerance
  3. Network can survive 1 validator failure

🎯 Next Steps

  1. Root cause identified: Quorum loss
  2. Generate new genesis with 2 validators
  3. Update node lists
  4. Restart validators
  5. Verify blocks resume
  6. Test bridge transaction

📚 References

  • Besu QBFT Documentation
  • QBFT requires: "Configure your network to ensure you never lose more than 1/3 of your validators"
  • Minimum validators for Byzantine fault tolerance: 4

Status: Root cause confirmed, solution ready to implement
Blocker: Insufficient validator quorum (2/5 vs 4/5 required)
Resolution: Reduce validator count to 2 or start 3 missing validators

Last Updated: 2026-01-24 01:32 PST