Files
proxmox/docs/04-configuration/verification-evidence/SOLACESCANSCOUT_DEEP_DIVE_FIXES_AND_TIMING.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

16 KiB
Raw Permalink Blame History

SolaceScanScout Deep-Dive: All Fixes Needed & Proactive vs Reactive Timing

Last Updated: 2026-02-09
Purpose: Investigate all fixes needed for the explorer, and define correct timing so we can be proactive instead of reactive.
Related: SOLACESCANSCOUT_CONNECTIONS_FULL_TREE.md, SOLACESCANSCOUT_REVIEW.md, BLOCKSCOUT_FIX_RUNBOOK.md.


Quick reference: when to act

Frequency What to do Script / location
One-time / after change Fix RPC URL on 5000 if RPC VMID retired; SSL in NPMplus; migrate 5000 to thin5 if thin1 full BLOCKSCOUT_FIX_RUNBOOK; NEXT_STEPS_OPERATOR
Daily 08:00 Explorer HTTPS + API must pass; indexer lag (RPC block vs explorer block) < threshold; RPC 2201 up daily-weekly-checks.sh (harden per §6.1)
Weekly (e.g. Sun) Explorer logs review; thin pool usage on r630-02 (warn >85%) O-4; new thin-pool check §6.2
On deploy / NPMplus change E2E routing; full explorer E2E from LAN; Blockscout migrations if needed verify-end-to-end-routing.sh; e2e-test-explorer.sh; fix-blockscout-ssl-and-migrations.sh

1. Executive Summary

Category Reactive (we discover when it breaks) Proactive (we detect before users do)
Explorer sync stop Users see stale blocks; 15-day lag happened Jan 2026 Daily check: compare RPC block vs explorer block; alert if lag > N blocks
502 / DB / migrations Public 502 on explorer.d-bis.org Daily: HTTPS + API reachability; weekly: logs; storage check before full
Thin pool full "No space left on device"; Docker/Blockscout fail Weekly (or before major deploys): thin pool % on r630-02
RPC endpoint wrong/down Indexer stops (e.g. VMID 2500 destroyed) Daily: RPC 2201 health; dependency list reviewed on infra changes
SSL / NPMplus "Connection isn't private" or 502 E2E run (e.g. after NPMplus changes); optional cert expiry check
Frontend/API config Wrong API URL or missing routes After deploy: E2E + explorer E2E from LAN

Key insight: The Jan 2026 “explorer 15 days behind” incident was reactive: we had no check that compared chain head block to explorers last indexed block. The daily cron only checks “API returns 200 with total_blocks” and does not fail when Blockscout is unreachable (it logs SKIP). So we stayed green until someone looked at the UI.


2. Complete Fix Inventory (All Known Issues & Fixes)

2.1 Critical (Explorer Unusable or Stale)

# Issue Root Cause Fix Runbook / Script
C1 Explorer stopped indexing (blocks stale) RPC unreachable (wrong IP or VM down), or indexer/DB crash Point ETHEREUM_JSONRPC_HTTP_URL to working RPC (e.g. 192.168.11.221:8545); restart Blockscout; fix DB if needed SOLACESCANSCOUT_REVIEW.md; BLOCKSCOUT_FIX_RUNBOOK
C2 502 Bad Gateway on explorer.d-bis.org Blockscout or Postgres down; or postgres nxdomain (Docker DNS); or thin pool full Restart stack; fix Docker network/DB URL; or migrate VM 5000 to thin5 BLOCKSCOUT_FIX_RUNBOOK; fix-blockscout-ssl-and-migrations.sh; fix-blockscout-1.sh
C3 SSL/migrations (migrations_status, blocks table missing) ECTO_USE_SSL=TRUE vs Postgres without SSL Run migrations with ?sslmode=disable and ECTO_USE_SSL=false; persist in docker-compose/.env fix-blockscout-ssl-and-migrations.sh
C4 No space left on device (thin pool 100%) thin1-r630-02 full; VM 5000 on thin1 Migrate VMID 5000 to thin5 (vzdump → destroy → restore to thin5); or free thin1 by moving other VMs BLOCKSCOUT_FIX_RUNBOOK; fix-blockscout-1.sh

2.2 High (Degraded or One-Time Config)

# Issue Root Cause Fix Runbook / Script
H1 RPC endpoint pointed to destroyed VM (e.g. 2500) VMID 2500 decommissioned; Blockscout env not updated Set ETHEREUM_JSONRPC_HTTP_URL=http://192.168.11.221:8545 (and WS if used) in Blockscout env on VM 5000 SOLACESCANSCOUT_REVIEW.md
H2 Explorer SSL "connection isn't private" No or invalid Let's Encrypt for explorer.d-bis.org in NPMplus NPMplus UI: SSL Certificates → request for explorer.d-bis.org; assign to proxy host, Force SSL NEXT_STEPS_OPERATOR.md § Explorer SSL
H3 NPMplus proxy wrong for explorer Proxy host points to wrong IP/port Update explorer.d-bis.org proxy to http://192.168.11.140:80 (and :4000 if API separate) update-npmplus-proxy-hosts-api.sh; RPC_ENDPOINTS_MASTER.md
H4 Blockscout container or service exited Crash or OOM; systemd "active (exited)" Restart: pct exec 5000 -- systemctl restart blockscout or docker-compose up -d; check logs SOLACESCANSCOUT_REVIEW.md; OPERATIONAL_RUNBOOKS [138]

2.3 Medium (Operational / Optional)

# Issue Root Cause Fix Runbook / Script
M1 Forge verification fails (params module/action) Blockscout API expects query params; Forge sends JSON Use run-contract-verification-with-proxy.sh or manual verification at explorer UI BLOCKSCOUT_FIX_RUNBOOK § Forge
M2 Custom frontend not served (wrong index.html or nginx) Nginx serves Blockscout at / instead of SolaceScanScout index.html deploy-frontend-to-vmid5000.sh; fix-nginx-serve-custom-frontend.sh deploy-frontend-to-vmid5000.sh
M3 Token list stale Token list not updated after new tokens Bump version/timestamp in dbis-138.tokenlist.json; validate; update explorer/config API reference OPERATIONAL_RUNBOOKS [139]; TOKEN_LIST_AUTHORING_GUIDE
M4 Explorer logs full or errors unnoticed No log review; disk full in container Weekly log review; cleanup-blockscout-journal.sh if needed OPERATIONAL_RUNBOOKS [138] (O-4)

2.4 One-Time / After Change

# Issue When Fix
O1 After destroying or changing RPC VMIDs Any RPC VMID decommissioned or IP change Update Blockscout env (and any script default RPC) to current RPC; update config/ip-addresses.conf and docs
O2 After NPMplus restore or major config change Restore from backup; new NPMplus instance Re-verify proxy hosts (explorer.d-bis.org → 192.168.11.140:80); re-request SSL if needed
O3 After Proxmox storage change New thin pool; migration of VMs Update BLOCKSCOUT_FIX_RUNBOOK and fix-blockscout-1.sh if default storage names change

3. Reactive vs Proactive: When We Learn About Each Issue

Issue Reactive trigger (we find out when…) Proactive detection (we could find out by…)
C1 Sync stop User or operator notices blocks are old Daily: Compare RPC eth_blockNumber to Blockscout /api/v2/stats (or indexer block). Alert if lag > e.g. 100 blocks or 10 min.
C2 502 / DB User gets 502; or E2E fails Daily: GET https://explorer.d-bis.org and https://explorer.d-bis.org/api/v2/stats; fail if non-2xx.
C3 SSL/migrations Blockscout wont start or crashes on boot On deploy/restart: Run migrations with correct flags; weekly: review logs for migration/DB errors.
C4 Thin pool full Docker or pct fails with "no space left" Weekly (or before big deploy): On r630-02 run lvs / pvesm status and check thin1 (and thin5) usage; alert if >85%.
H1 Wrong RPC Indexer stops when that RPC is gone When changing infra: Checklist: “Update Blockscout RPC URL if any RPC VMID/IP changed.” Daily: RPC 2201 health (already in daily-weekly-checks).
H2 SSL User sees certificate warning E2E run after NPMplus changes; optional monthly cert expiry check.
H3 NPMplus proxy wrong 502 or wrong site when opening explorer.d-bis.org E2E: verify-end-to-end-routing.sh (DNS, SSL, HTTPS 200).
H4 Container exited 502 or API down Daily: Same as C2 (HTTPS + API); weekly: logs (O-4).

4. Current Monitoring vs Whats Missing

4.1 What Exists Today

Check Frequency Script / Cron Limitation
Explorer indexer (API reachable) Daily 08:00 daily-weekly-checks.sh [135] Does not fail when Blockscout unreachable (logs SKIP).
RPC 2201 health Daily 08:00 daily-weekly-checks.sh [136] Good; fails if RPC down.
Config API Weekly Sun 09:00 daily-weekly-checks.sh [137] Not explorer-specific.
Explorer logs Weekly (manual) OPERATIONAL_RUNBOOKS [138] Reminder only; no automated parse.
E2E (DNS, SSL, HTTPS) On-demand verify-end-to-end-routing.sh Optional Blockscout API; can skip off-LAN.
Explorer + block production On-demand verify-explorer-and-block-production.sh Compares RPC block to chain; does not compare explorer block to RPC block (indexer lag).
Thin pool On-demand fix-blockscout-1.sh (when already broken); investigate-thin2-storage.sh No scheduled thin pool check for r630-02 thin1.

4.2 Gaps (Why We Were Reactive)

  1. No indexer lag check
    We never compare “latest block on RPC” vs “latest block in Blockscout.” So we dont detect “API is up but indexer stopped” until someone looks at the UI or block count.

  2. Explorer check is soft
    If Blockscout is down, daily-weekly-checks.sh prints SKIP and does not increment FAILED. Cron stays “green” while explorer is broken.

  3. No thin pool monitoring
    thin1-r630-02 can reach 100% with no alert. First sign is often “no space left on device” during a restart or pull.

  4. No automated alerting
    Cron only logs to a file. No email, PagerDuty, or dashboard that fails when explorer or RPC fails.

  5. RPC dependency not formalized
    When VMID 2500 was destroyed, Blockscouts RPC URL wasnt in a “dependency list” thats reviewed on infra changes.


5.1 One-Time (Do Once or After Change)

Action When Owner
Fix RPC URL on VM 5000 Already done (192.168.11.221). Re-do whenever an RPC VMID used by explorer is retired or re-IPd Ops
Add explorer.d-bis.org to “infra dependency” list When documenting RPC/explorer relationship Ops
Request SSL for explorer.d-bis.org in NPMplus Once (and after any NPMplus restore that loses certs) Ops
Migrate VM 5000 to thin5 if thin1 is near full Once (or when thin1 >85%) Ops

5.2 Daily (Catch Outages and Sync Stop)

Action When Implementation
Explorer HTTPS 200 Daily 08:00 (with existing cron) Add to daily-weekly-checks: GET https://explorer.d-bis.org, fail if not 2xx (run from host that can reach it or use public URL).
Explorer API 200 + body Daily 08:00 Same script: GET https://explorer.d-bis.org/api/v2/stats (or http://192.168.11.140:4000 from LAN); fail if not 200 or missing total_blocks/total_transactions.
Indexer lag Daily 08:00 New check: (1) RPC eth_blockNumber → chain_head. (2) Blockscout API → last indexed block (or total_blocks). (3) If chain_head - last_indexed > threshold (e.g. 100 blocks or 5 min), fail.
RPC 2201 health Already daily 08:00 Keep as-is (critical for indexer).

5.3 Weekly (Catch Slow Degradation)

Action When Implementation
Review explorer logs Weekly (e.g. Sun 09:00) Keep O-4: pct exec 5000 -- journalctl -u blockscout -n 200 (or SSH); optional: grep for ERROR / nxdomain / ssl.
Thin pool usage r630-02 Weekly (e.g. Sun) or before major deploy New: SSH to r630-02, run pvesm status | grep thin and/or lvs | grep thin; warn if thin1 >85%; fail if 100%.
Config API Already weekly Keep [137].

5.4 On-Deploy / On-Change

Action When Implementation
E2E routing After NPMplus or DNS changes Run verify-end-to-end-routing.sh (include explorer.d-bis.org).
Full explorer E2E (LAN) After frontend or Blockscout deploy Run explorer-monorepo/scripts/e2e-test-explorer.sh from LAN.
Blockscout migrations Before/after Blockscout version or config change fix-blockscout-ssl-and-migrations.sh or manual migration with sslmode=disable.

6. Concrete Script and Cron Changes

6.1 Harden daily-weekly-checks.sh (Explorer)

  • Current: [135] Explorer indexer: curl to :4000; on failure print SKIP and do not increment FAILED.
  • Change:
    • Option A (minimal): When running from LAN (or when PUBLIC_EXPLORER_CHECK=1), also GET https://explorer.d-bis.org. If both API and homepage fail, increment FAILED.
    • Option B (recommended): Add an indexer lag check:
      • From LAN: get RPC block number (192.168.11.221:8545 eth_blockNumber).
      • Get Blockscout last block from /api/v2/stats or /api/v2/blocks (or indexer stats).
      • If RPC_block - explorer_block > 500 (or time-based, e.g. >10 min), increment FAILED and log “Explorer indexer lag > 500 blocks”.
    • Ensure at least one explorer check fails the daily run when the explorer is clearly broken (e.g. API unreachable from LAN).

6.2 Add Weekly Thin Pool Check

  • New script or block in weekly: On r630-02 (192.168.11.12), run:
    • ssh root@192.168.11.12 'pvesm status 2>/dev/null | grep -E "thin1|thin5"'
    • Parse usage (e.g. 5th column); if thin1-r630-02 > 85%, log warning; if 100%, fail.
  • Cron: Add to weekly branch of schedule-daily-weekly-cron.sh, or separate weekly script that runs Sunday.

6.3 Optional: Alerting

  • Pipe daily/weekly check output to a log; have a wrapper that:
    • Sends email or Slack on FAILED > 0, or
    • Writes to a file that Prometheus/Grafana can scrape (e.g. “explorer_ok 0” vs “explorer_ok 1”).

6.4 Dependency Checklist (Procedural)

  • In OPERATIONAL_RUNBOOKS or BLOCKSCOUT_FIX_RUNBOOK, add:
    • When decommissioning or changing RPC nodes: Check if Blockscout (VMID 5000) uses that RPC; if yes, update ETHEREUM_JSONRPC_HTTP_URL and restart Blockscout.
  • In SOLACESCANSCOUT_CONNECTIONS_FULL_TREE or a “dependency” section: list “Explorer (5000) depends on: RPC 2201 (192.168.11.221).”

7. Summary: From Reactive to Proactive

Before (Reactive) After (Proactive)
Discover sync stop when users report stale data Daily: compare RPC block vs explorer block; fail if lag > threshold
Discover 502 when someone opens explorer Daily: HTTPS + API check that fails the run if down
Discover thin pool full when Docker fails Weekly: check thin1 (and thin5) usage on r630-02; warn at 85%
Update RPC URL only after indexer breaks Checklist on infra change: “Update Blockscout RPC if RPC VMID/IP changed”
Explorer check never fails cron Harden daily check so unreachable explorer or large indexer lag fails the job

Implementing §5 (Recommended Proactive Timing) and §6 (Script and Cron Changes) will move SolaceScanScout operations from reactive to proactive, with clear timing for each fix category.


Last updated: 2026-02-09
References: SOLACESCANSCOUT_CONNECTIONS_FULL_TREE.md, SOLACESCANSCOUT_REVIEW.md, BLOCKSCOUT_FIX_RUNBOOK.md, OPERATIONAL_RUNBOOKS.md, daily-weekly-checks.sh, verify-explorer-and-block-production.sh