diff --git a/docs/04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md b/docs/04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md new file mode 100644 index 00000000..61d870bd --- /dev/null +++ b/docs/04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md @@ -0,0 +1,352 @@ +# r630-04 Storage Remediation And MEV Plan + +**Last updated:** 2026-04-14 +**Purpose:** Surgical remediation plan for `r630-04` local storage after live read errors on `sda`, with exact next steps for either migrating MEV off the node or rebuilding a safe dedicated local MEV pool on the two hidden Samsung SSDs. + +## 1. Current facts + +### 1.1 Confirmed bad drive + +The failing drive is: + +- `megaraid,0` / `/dev/sda` +- Model: `ST9300653SS` +- Serial: `6XN7PB91` + +Observed failure evidence: + +- kernel `Medium Error` +- `Unrecovered read error` +- read failures on the old swap LV +- SMART still says `OK`, but: + - `Elements in grown defect list: 2804` + - `Non-medium error count: 1700390` + +Treat `sda` as **degraded / unsafe** for continued production locality. + +### 1.2 Other currently visible disks + +- `megaraid,1` / `/dev/sdb` — healthy 300G SAS, serial `PQHTWUVB`, currently part of VG `pve` +- `megaraid,4` / `/dev/sdc` — Crucial MX500 250G, serial `2202E5FB4CC9`, currently used by Ceph +- `megaraid,5` / `/dev/sdd` — Crucial MX500 250G, serial `2203E5FE0911`, currently used by Ceph +- `megaraid,6` / `/dev/sde` — Crucial MX500 250G, serial `2203E5FE0912`, currently used by Ceph +- `megaraid,7` / `/dev/sdf` — Crucial MX500 250G, serial `2202E5FB4CC2`, currently used by Ceph + +### 1.3 Hidden controller-visible SSDs + +The MegaRAID controller sees two additional healthy SSDs that Linux does not currently expose as `/dev/sd*` devices: + +- `megaraid,2` — Samsung SSD 860 EVO 250GB, serial `S3YHNB0K308072M` +- `megaraid,3` — Samsung SSD 860 EVO 250GB, serial `S3YJNB0K597631B` + +Health indicators for both: + +- SMART overall health: `PASSED` +- reallocated sectors: `0` +- no uncorrectables observed in the SMART summary we pulled + +These two drives are the best candidates for a dedicated local MEV storage pool on `r630-04`, but they are currently hidden behind the controller. + +## 2. Immediate operating posture + +Already applied live: + +- host swap disabled +- `/etc/fstab` swap line commented out +- `vm.swappiness=1` +- `vm.vfs_cache_pressure=50` +- CT `2421` now runs with: + - `memory: 49152` + - `swap: 0` + - `cpuunits: 4096` + +These changes reduce the blast radius, but they do **not** make `r630-04` local storage trustworthy while `sda` remains in path for: + +- `pve-root` +- thin-pool metadata +- part of `pve/data` + +## 3. Decision paths + +There are two valid paths. + +### Path A: Fastest risk reduction + +Move CT `2421` off `r630-04` to `r630-03`. + +Use this when: + +- you want MEV risk reduced immediately +- you do not want to touch controller config first +- you are willing to keep the `r630-04` storage redesign as a second phase + +### Path B: Keep MEV on r630-04, but move it off the degraded local pool + +Expose the two Samsung SSDs to Linux, build a dedicated thinpool on them, and move CT `2421` onto that new storage. + +Use this when: + +- you want MEV to stay on `r630-04` +- you are comfortable making controller-level storage changes +- you want a clean local storage class for MEV that is not mixed with the failing `sda` + +## 4. Recommended order + +The safest overall sequence is: + +1. keep MEV stable with the already-applied tuning +2. choose one of: + - Path A first, then redesign `r630-04` + - Path B directly if you want to keep `2421` local to `r630-04` +3. replace / retire `sda` +4. only after that, reuse `r630-04` broadly for new CT locality + +## 5. Path A: Migrate CT 2421 to r630-03 + +### 5.1 Why this is still the safest immediate option + +- `r630-03` has active `local-lvm` +- `r630-03` has large free thin capacity +- `r630-03` has enough available memory for CT `2421` +- this avoids controller work during a live application incident window + +### 5.2 Preflight + +```bash +ssh root@192.168.11.14 'pct status 2421' +ssh root@192.168.11.13 'pvecm status; pvesm status | egrep "^(data|local-lvm|local)"' +``` + +### 5.3 Preferred migration + +```bash +ssh root@192.168.11.14 'pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock' +ssh root@192.168.11.14 'pct migrate 2421 r630-03 --storage local-lvm --online 0' +``` + +### 5.4 Fallback if direct migrate is unhappy + +```bash +ssh root@192.168.11.14 'vzdump 2421 --mode stop --compress zstd --storage local' +scp root@192.168.11.14:/var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst /tmp/ +scp /tmp/vzdump-lxc-2421-*.tar.zst root@192.168.11.13:/var/lib/vz/dump/ +ssh root@192.168.11.13 'pct restore 2421 /var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst --storage local-lvm' +``` + +### 5.5 Post-migration verification + +```bash +ssh root@192.168.11.13 'pct start 2421' +curl -fsS https://mev.defi-oracle.io/api/health +curl -fsS https://mev.defi-oracle.io/api/infra +API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \ +bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain +``` + +## 6. Path B: Build a dedicated MEV thinpool on the hidden Samsung SSDs + +### 6.1 Preconditions + +You need controller-level access to the MegaRAID state for the two Samsung SSDs: + +- `S3YHNB0K308072M` +- `S3YJNB0K597631B` + +The host does **not** currently have `storcli`, `perccli`, `megacli`, or `omreport` installed from apt. That means one of these must be used: + +- Dell Lifecycle Controller / iDRAC storage UI +- vendor-provided `storcli` / `perccli` package copied in manually +- bootable maintenance environment with MegaRAID tooling + +### 6.2 What must be determined first + +Before changing anything, identify the controller slot / enclosure and current state for those two serial numbers. Possible states: + +- `UGood` +- `JBOD` +- `Hot Spare` +- `Foreign` +- `Offline` +- part of an old single-drive virtual disk + +### 6.3 If using storcli / perccli + +Typical discovery flow: + +```bash +storcli /c0 show +storcli /c0 /eall /sall show all +``` + +Find the rows matching: + +- `S3YHNB0K308072M` +- `S3YJNB0K597631B` + +Record: + +- enclosure id +- slot id +- state + +### 6.4 Controller actions by state + +These are the safe controller actions by scenario. + +If the drives are **global hot spares**: + +```bash +storcli /c0 /e /s delete hotsparedrive +``` + +If the drives are **foreign**: + +```bash +storcli /c0 /fall del +``` + +If the drives are **unconfigured-good** and the controller supports JBOD: + +```bash +storcli /c0 /e /s set jbod +``` + +If the controller does **not** support JBOD for the chosen mode, create two single-drive RAID0 virtual disks instead. In that case, Linux will see two new logical disks and they can still be used for a dedicated MEV VG. + +### 6.5 OS-level confirmation after exposure + +After the controller exposes the disks, verify new block devices appear: + +```bash +lsblk -d -o NAME,SIZE,MODEL,SERIAL,ROTA,TYPE,TRAN +``` + +You want to see the two Samsung serials appear as Linux devices. + +Then verify they are unused: + +```bash +blkid /dev/sdX +wipefs -n /dev/sdX +pvs +``` + +If they are clean, continue. + +### 6.6 Create a dedicated MEV VG and thinpool + +Use stable disk identifiers by serial, not guessed `/dev/sdX` names. + +Example: + +```bash +pvcreate /dev/disk/by-id/ /dev/disk/by-id/ +vgcreate pve-mev /dev/disk/by-id/ /dev/disk/by-id/ +lvcreate -l 95%VG -T -n data pve-mev +``` + +Verify: + +```bash +vgs pve-mev +lvs pve-mev +``` + +### 6.7 Add storage to Proxmox + +```bash +pvesm add lvmthin mev-local-lvm \ + --vgname pve-mev \ + --thinpool data \ + --content images,rootdir \ + --nodes r630-04 +``` + +Verify: + +```bash +pvesm status | egrep "mev-local-lvm|local-lvm|data" +``` + +### 6.8 Move CT 2421 onto the new storage + +The cleanest move is while the CT is stopped. + +Preflight backup: + +```bash +vzdump 2421 --mode stop --compress zstd --storage local +``` + +Stop the CT: + +```bash +pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock +``` + +Preferred move in place: + +```bash +pct move-volume 2421 rootfs mev-local-lvm --delete 1 +``` + +Then confirm: + +```bash +pct config 2421 | grep '^rootfs:' +``` + +Start the CT: + +```bash +pct start 2421 +``` + +### 6.9 Post-move verification + +```bash +pct status 2421 +curl -fsS https://mev.defi-oracle.io/api/health +curl -fsS https://mev.defi-oracle.io/api/infra +curl -fsS https://mev.defi-oracle.io/api/stats/freshness +API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \ +bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain +``` + +## 7. Rollback points + +### Path A rollback + +- if migration fails before cutover, keep `2421` stopped on `r630-04` and restart it there +- if restore-on-target fails, original CT still exists on source until explicitly destroyed + +### Path B rollback + +- if controller exposure step looks wrong, stop and do **not** create PVs +- if VG/thinpool creation fails, remove only the new VG and leave CT `2421` where it is +- if `pct move-volume` fails, the preflight `vzdump` is the safety net + +## 8. Recommendation + +If the priority is **lowest operational risk**, do: + +1. **Path A now** — move CT `2421` to `r630-03` +2. then repair `r630-04` storage at leisure + +If the priority is **keeping MEV on r630-04**, do: + +1. expose the two Samsung SSDs from the controller +2. build `pve-mev` +3. move CT `2421` to `mev-local-lvm` +4. then retire / replace `sda` + +## 9. Current practical recommendation + +Because the two Samsung SSDs are healthy and already identified by serial, `r630-04` does have a viable long-term local storage redesign path. + +But because `sda` is already erroring in production, the lowest-risk sequence remains: + +1. keep MEV stable with the applied hardening +2. migrate `2421` to `r630-03` if you want immediate risk removal +3. redesign `r630-04` local storage afterward + diff --git a/docs/MASTER_INDEX.md b/docs/MASTER_INDEX.md index 7843d3c3..588f5550 100644 --- a/docs/MASTER_INDEX.md +++ b/docs/MASTER_INDEX.md @@ -16,7 +16,7 @@ | **Agent / IDE instructions** | [AGENTS.md](../AGENTS.md) (repo root) | | **Local green-path tests** | Root `pnpm test` → [`scripts/verify/run-repo-green-test-path.sh`](../scripts/verify/run-repo-green-test-path.sh) | | **Git submodule hygiene + explorer remotes** | [00-meta/SUBMODULE_HYGIENE.md](00-meta/SUBMODULE_HYGIENE.md) — detached HEAD, push order, Gitea/GitHub, `submodules-clean.sh` | -| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) | +| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); `r630-04` storage repair / MEV pool redesign: [04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md](04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) | | **What to do next** | [00-meta/NEXT_STEPS_INDEX.md](00-meta/NEXT_STEPS_INDEX.md) — ordered actions, by audience, execution plan | | **Recent cleanup / handoff summary** | [00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md](00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md) | | **Live verification evidence (dated)** | [00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md](00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md) |