feat(wormhole): AI docs mirror, MCP server, playbook, RAG, verify script

- Playbook + RAG doc; Cursor rule; sync script + manifest snapshot - mcp-wormhole-docs: resources + wormhole_doc_search (read-only) - verify-wormhole-ai-docs-setup.sh health check Wire pnpm-workspace + lockfile + AGENTS/MCP_SETUP/MASTER_INDEX in a follow-up if not already committed. Made-with: Cursor
2026-03-31 21:05:06 -07:00
parent 7f3dcf2513
commit 0f70fb6c90
10 changed files with 679 additions and 0 deletions
--- a/docs/04-configuration/WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md
+++ b/docs/04-configuration/WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md
@@ -0,0 +1,63 @@
+# Wormhole AI resources — LLM and agent playbook
+
+**Purpose:** How agents and humans should use Wormhole’s published documentation bundles for **Wormhole protocol** work, without mixing them up with **DBIS Chain 138** canonical facts in this repo.
+
+**Upstream hub:** [Wormhole AI Resources](https://wormhole.com/docs/ai-resources/ai-resources/)  
+**Source repo (docs):** [wormhole-foundation/wormhole-docs](https://github.com/wormhole-foundation/wormhole-docs)
+
+---
+
+## Canonical fetch URLs (verified)
+
+Use these URLs in automation and MCP. The path `https://wormhole.com/docs/docs/...` (double `docs`) returns **404**; prefer the paths below.
+
+| Tier | Artifact | URL |
+|------|-----------|-----|
+| 1 | `llms.txt` | `https://wormhole.com/docs/llms.txt` |
+| 2 | `site-index.json` | `https://wormhole.com/docs/ai/site-index.json` |
+| 3 | Category bundles | `https://wormhole.com/docs/ai/categories/<name>.md` |
+| 4 | Full corpus | `https://wormhole.com/docs/ai/llms-full.jsonl` |
+
+**Category file names:** `basics`, `ntt`, `connect`, `wtt`, `settlement`, `executor`, `multigov`, `queries`, `transfer`, `typescript-sdk`, `solidity-sdk`, `cctp`, `reference` (each as `<name>.md`).
+
+**Per-page markdown (optional):** entries in `site-index.json` include `resolved_md_url` / `raw_md_url` under `https://wormhole.com/docs/ai/pages/...` for single-page depth.
+
+---
+
+## Consumption ladder (smallest context first)
+
+1. **`llms.txt`** — Map of the doc site and links; use first to decide where to go next.
+2. **`site-index.json`** — Lightweight index (title, preview, categories, URLs); use for retrieval and “which page answers X”.
+3. **Category `.md`** — Focused implementation tasks (NTT, TypeScript SDK, reference, etc.).
+4. **`llms-full.jsonl`** — Full text + metadata; use only for large-context models or **indexed RAG** (see [WORMHOLE_AI_RESOURCES_RAG.md](WORMHOLE_AI_RESOURCES_RAG.md)). Do not paste whole file into a small context window.
+
+Wormhole notes these files are **informational only** (no embedded persona); safe to combine with project [AGENTS.md](../../AGENTS.md) and Cursor rules.
+
+---
+
+## Boundary: Wormhole vs this repo
+
+| Topic | Source |
+|--------|--------|
+| Wormhole NTT, Connect, VAAs, Guardians, Executor, Wormhole CCTP integration, chain IDs **on Wormhole-supported networks** | Wormhole AI bundles + official Wormhole reference |
+| Chain **138** token addresses, PMM pools, DODOPMMIntegration, **CCIP** routes, deployer wallet, Blockscout labels | Repo canonical docs: [EXPLORER_TOKEN_LIST_CROSSCHECK.md](../11-references/EXPLORER_TOKEN_LIST_CROSSCHECK.md), [ADDRESS_MATRIX_AND_STATUS.md](../11-references/ADDRESS_MATRIX_AND_STATUS.md), [07-ccip/](../07-ccip/) runbooks |
+
+Do **not** treat Wormhole docs as authority for Chain 138 deployment facts. Do **not** treat this repo’s CCIP docs as authority for Wormhole core contracts on other chains.
+
+---
+
+## Local mirror and MCP
+
+- **Sync script:** `bash scripts/doc/sync-wormhole-ai-resources.sh` — downloads tier 1–3 (and optionally tier 4) into `third-party/wormhole-ai-docs/` and writes `manifest.json` (SHA-256, timestamps).
+- **MCP server:** `mcp-wormhole-docs/` — read-only resources and `wormhole_doc_search`; see [MCP_SETUP.md](MCP_SETUP.md).
+- **Health check:** `bash scripts/verify/verify-wormhole-ai-docs-setup.sh` — mirror presence + `node --check` on the MCP entrypoint.
+
+---
+
+## Security and ops
+
+- Fetch only from `https://wormhole.com/docs/` (allowlisted in the MCP server when using live fetch).
+- `llms-full.jsonl` is large; mirror with `INCLUDE_FULL_JSONL=1` only when needed for RAG or offline use.
+- Re-sync when Wormhole ships breaking doc changes; keep `manifest.json` for audit (“which snapshot was used?”).
+
+**Licensing:** Wormhole Foundation material — use mirrors and RAG **consistent with their terms**; link to original URLs in answers when possible.
--- a/docs/04-configuration/WORMHOLE_AI_RESOURCES_RAG.md
+++ b/docs/04-configuration/WORMHOLE_AI_RESOURCES_RAG.md
@@ -0,0 +1,64 @@
+# Wormhole `llms-full.jsonl` — RAG and chunking strategy
+
+**Purpose:** How to index Wormhole’s full documentation export for retrieval-augmented generation without blowing context limits or drowning out Chain 138 canonical facts.
+
+**Prerequisite:** Download the corpus with `INCLUDE_FULL_JSONL=1 bash scripts/doc/sync-wormhole-ai-resources.sh` (or `--full-jsonl`). File: `third-party/wormhole-ai-docs/llms-full.jsonl` (gitignored; large).
+
+**Playbook (tiers):** [WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md](WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md)
+
+---
+
+## Category-first retrieval (default policy)
+
+1. **Before** querying `llms-full.jsonl`, resolve intent:
+   - **Broad protocol** → start from mirrored `categories/basics.md` or `reference.md`.
+   - **Product-specific** → pick the matching category file (`ntt.md`, `cctp.md`, `typescript-sdk.md`, etc.) from the mirror or `https://wormhole.com/docs/ai/categories/<name>.md`.
+2. Use **`site-index.json`** (tier 2) to rank **page-level** `id` / `title` / `preview` / `categories` and obtain `html_url` / `resolved_md_url`.
+3. Only then ingest or search **full JSONL** lines that correspond to those pages (if your pipeline supports filtering by `id` or URL prefix).
+
+This keeps answers aligned with Wormhole’s own doc structure and reduces irrelevant hits.
+
+---
+
+## Chunking `llms-full.jsonl`
+
+The file is **JSON Lines**: each line is one JSON object (typically one doc page or chunk with metadata).
+
+**Recommended:**
+
+- **Parse line-by-line** (streaming); do not load the entire file into RAM for parsing.
+- **One line = one logical chunk** if each object already represents a single page; if objects are huge, split on `sections` or headings when present in the schema.
+- **Metadata to store per chunk:** at minimum `id`, `title`, `slug`, `html_url`, and any `categories` / `hash` fields present in that line. Prefer storing **source URL** for citation in agent answers.
+- **Embeddings:** embed `title + "\n\n" + body_or_preview` (or equivalent text field in the object); keep URL in metadata only for the retriever to return to the user.
+
+**Deduplication:** if the same `hash` or `id` appears across syncs, replace vectors for that id on re-index.
+
+---
+
+## Query flow (RAG)
+
+```mermaid
+flowchart TD
+  Q[User query] --> Intent{Wormhole product area?}
+  Intent -->|yes| Cat[Retrieve from category md slice]
+  Intent -->|unclear| Idx[Search site-index.json previews]
+  Cat --> NeedFull{Need deeper text?}
+  Idx --> NeedFull
+  NeedFull -->|no| Ans[Answer with citations]
+  NeedFull -->|yes| JSONL[Vector search filtered llms-full.jsonl by category or id]
+  JSONL --> Ans
+```
+
+---
+
+## Boundaries
+
+- RAG over Wormhole docs improves **Wormhole** answers; it does **not** override [EXPLORER_TOKEN_LIST_CROSSCHECK.md](../11-references/EXPLORER_TOKEN_LIST_CROSSCHECK.md) or CCIP runbooks for **Chain 138** deployment truth.
+- If a user question mixes both (e.g. “bridge USDC to Chain 138 via Wormhole”), answer in **two explicit sections**: Wormhole mechanics vs this repo’s CCIP / 138 facts.
+
+---
+
+## Re-sync and audit
+
+- After `sync-wormhole-ai-resources.sh`, commit or archive **`third-party/wormhole-ai-docs/manifest.json`** when you want a recorded snapshot (hashes per file).
+- Rebuild or delta-update the vector index when `manifest.json` changes.