Add full monorepo: virtual-banker, backend, frontend, docs, scripts, deployment

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
defiQUG
2026-02-10 11:32:49 -08:00
parent 4d4f8cedad
commit 903c03c65b
815 changed files with 125522 additions and 264 deletions

View File

@@ -0,0 +1,294 @@
# Data Lake Schema Specification
## Overview
This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.
## Storage Structure
### Directory Layout
```
s3://explorer-data-lake/
├── raw/
│ ├── chain_id=138/
│ │ ├── year=2024/
│ │ │ ├── month=01/
│ │ │ │ ├── day=01/
│ │ │ │ │ ├── blocks.parquet
│ │ │ │ │ ├── transactions.parquet
│ │ │ │ │ └── logs.parquet
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── ...
│ └── ...
├── processed/
│ ├── chain_id=138/
│ │ ├── daily_aggregates/
│ │ │ ├── year=2024/
│ │ │ │ └── month=01/
│ │ │ │ └── day=01.parquet
│ │ └── ...
│ └── ...
└── archived/
└── ...
```
### Partitioning Strategy
**Partition Keys**:
- `chain_id`: Chain identifier
- `year`: Year (YYYY)
- `month`: Month (MM)
- `day`: Day (DD)
**Benefits**:
- Efficient query pruning
- Parallel processing
- Easy data management (delete by partition)
## Parquet Schema
### Blocks Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "number", "type": "long", "nullable": false},
{"name": "hash", "type": "string", "nullable": false},
{"name": "parent_hash", "type": "string", "nullable": false},
{"name": "timestamp", "type": "timestamp", "nullable": false},
{"name": "miner", "type": "string", "nullable": true},
{"name": "gas_used", "type": "long", "nullable": true},
{"name": "gas_limit", "type": "long", "nullable": true},
{"name": "transaction_count", "type": "integer", "nullable": true},
{"name": "size", "type": "integer", "nullable": true}
]
}
```
### Transactions Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "transaction_index", "type": "integer", "nullable": false},
{"name": "from_address", "type": "string", "nullable": false},
{"name": "to_address", "type": "string", "nullable": true},
{"name": "value", "type": "string", "nullable": false}, // Decimal as string
{"name": "gas_price", "type": "long", "nullable": true},
{"name": "gas_used", "type": "long", "nullable": true},
{"name": "gas_limit", "type": "long", "nullable": false},
{"name": "status", "type": "integer", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
```
### Logs Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "transaction_hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "log_index", "type": "integer", "nullable": false},
{"name": "address", "type": "string", "nullable": false},
{"name": "topic0", "type": "string", "nullable": true},
{"name": "topic1", "type": "string", "nullable": true},
{"name": "topic2", "type": "string", "nullable": true},
{"name": "topic3", "type": "string", "nullable": true},
{"name": "data", "type": "string", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
```
### Token Transfers Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "transaction_hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "token_address", "type": "string", "nullable": false},
{"name": "token_type", "type": "string", "nullable": false},
{"name": "from_address", "type": "string", "nullable": false},
{"name": "to_address", "type": "string", "nullable": false},
{"name": "amount", "type": "string", "nullable": true},
{"name": "token_id", "type": "string", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
```
## Data Ingestion
### ETL Pipeline
**Process**:
1. Extract: Query PostgreSQL for daily data
2. Transform: Convert to Parquet format
3. Load: Upload to S3 with partitioning
**Schedule**: Daily batch job after day ends
**Tools**: Apache Spark, AWS Glue, or custom ETL scripts
### Compression
**Format**: Snappy compression (good balance of speed and compression ratio)
**Alternative**: Gzip (better compression, slower)
### File Sizing
**Target Size**: 100-500 MB per Parquet file
- Smaller files: Better parallelism
- Larger files: Better compression
**Strategy**: Write files of target size, or split by time ranges
## Query Interface
### AWS Athena / Presto
**Table Definition**:
```sql
CREATE EXTERNAL TABLE blocks_138 (
chain_id int,
number bigint,
hash string,
parent_hash string,
timestamp timestamp,
miner string,
gas_used bigint,
gas_limit bigint,
transaction_count int,
size int
)
STORED AS PARQUET
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2020,2030',
'projection.month.type' = 'integer',
'projection.month.range' = '1,12',
'projection.day.type' = 'integer',
'projection.day.range' = '1,31'
);
```
### Query Examples
**Daily Transaction Count**:
```sql
SELECT
DATE(timestamp) as date,
COUNT(*) as transaction_count
FROM transactions_138
WHERE year = 2024 AND month = 1
GROUP BY DATE(timestamp)
ORDER BY date;
```
**Token Transfer Analytics**:
```sql
SELECT
token_address,
COUNT(*) as transfer_count,
SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
FROM token_transfers_138
WHERE year = 2024 AND month = 1
GROUP BY token_address
ORDER BY total_volume DESC
LIMIT 100;
```
## Data Retention
### Retention Policies
**Raw Data**: 7 years (compliance requirement)
**Processed Aggregates**: Indefinite
**Archived Data**: Move to Glacier after 1 year
### Lifecycle Policies
**S3 Lifecycle Rules**:
1. Move to Infrequent Access after 30 days
2. Move to Glacier after 1 year
3. Delete after 7 years (raw data)
## Data Processing
### Aggregation Jobs
**Daily Aggregates**:
- Transaction counts by hour
- Gas usage statistics
- Token transfer volumes
- Address activity metrics
**Monthly Aggregates**:
- Network growth metrics
- Token distribution changes
- Protocol usage statistics
### ML/Analytics Workflows
**Use Cases**:
- Anomaly detection
- Fraud detection
- Market analysis
- Network health monitoring
**Tools**: Spark, Pandas, Jupyter notebooks
## Security and Access Control
### Access Control
**IAM Policies**: Restrict access to specific prefixes
**Encryption**: Server-side encryption (SSE-S3 or SSE-KMS)
**Audit Logging**: Enable S3 access logging
### Data Classification
**Public Data**: Blocks, transactions (public blockchain data)
**Sensitive Data**: User addresses, labels (requires authentication)
**Compliance Data**: Banking/transaction data (strict access control)
## Cost Optimization
### Storage Optimization
**Strategies**:
- Use appropriate storage classes (Standard, IA, Glacier)
- Compress data (Parquet + Snappy)
- Delete old data per retention policy
- Use intelligent tiering
### Query Optimization
**Strategies**:
- Partition pruning (query only relevant partitions)
- Column pruning (select only needed columns)
- Predicate pushdown (filter early)
## References
- Database Schema: See `postgres-schema.md`
- Analytics: See `../observability/metrics-monitoring.md`

View File

@@ -0,0 +1,300 @@
# Graph Database Schema Specification
## Overview
This document specifies the Neo4j graph database schema for storing cross-chain entity relationships, address clustering, and protocol interactions.
## Schema Design
### Node Types
#### Address Node
**Labels**: `Address`, `Chain{chain_id}` (e.g., `Chain138`)
**Properties**:
```cypher
{
address: "0x...", // Unique identifier
chainId: 138, // Chain ID
label: "My Wallet", // Optional label
isContract: false, // Is contract address
firstSeen: timestamp, // First seen timestamp
lastSeen: timestamp, // Last seen timestamp
transactionCount: 100, // Transaction count
balance: "1.5" // Current balance (string for precision)
}
```
**Constraints**:
```cypher
CREATE CONSTRAINT address_address_chain_id FOR (a:Address)
REQUIRE (a.address, a.chainId) IS UNIQUE;
```
#### Contract Node
**Labels**: `Contract`, `Address`
**Properties**: Inherits from Address, plus:
```cypher
{
name: "MyToken",
verificationStatus: "verified",
compilerVersion: "0.8.19"
}
```
#### Token Node
**Labels**: `Token`, `Contract`
**Properties**: Inherits from Contract, plus:
```cypher
{
symbol: "MTK",
decimals: 18,
totalSupply: "1000000",
type: "ERC20" // ERC20, ERC721, ERC1155
}
```
#### Protocol Node
**Labels**: `Protocol`
**Properties**:
```cypher
{
name: "Uniswap V3",
category: "DEX",
website: "https://uniswap.org"
}
```
### Relationship Types
#### TRANSFERRED_TO
**Purpose**: Track token transfers between addresses.
**Properties**:
```cypher
{
amount: "1000000000000000000",
tokenAddress: "0x...",
transactionHash: "0x...",
blockNumber: 12345,
timestamp: timestamp
}
```
**Example**:
```cypher
(a1:Address {address: "0x..."})-[r:TRANSFERRED_TO {
amount: "1000000000000000000",
tokenAddress: "0x...",
transactionHash: "0x..."
}]->(a2:Address {address: "0x..."})
```
#### CALLED
**Purpose**: Track contract calls between addresses.
**Properties**:
```cypher
{
transactionHash: "0x...",
blockNumber: 12345,
timestamp: timestamp,
gasUsed: 21000,
method: "transfer"
}
```
#### OWNS
**Purpose**: Track token ownership (current balances).
**Properties**:
```cypher
{
balance: "1000000000000000000",
tokenId: "123", // For ERC-721/1155
updatedAt: timestamp
}
```
**Example**:
```cypher
(a:Address)-[r:OWNS {
balance: "1000000000000000000",
updatedAt: timestamp
}]->(t:Token)
```
#### INTERACTS_WITH
**Purpose**: Track protocol interactions.
**Properties**:
```cypher
{
interactionType: "swap", // swap, deposit, withdraw, etc.
transactionHash: "0x...",
timestamp: timestamp
}
```
**Example**:
```cypher
(a:Address)-[r:INTERACTS_WITH {
interactionType: "swap",
transactionHash: "0x..."
}]->(p:Protocol)
```
#### CLUSTERED_WITH
**Purpose**: Link addresses that belong to the same entity (address clustering).
**Properties**:
```cypher
{
confidence: 0.95, // Clustering confidence score
method: "heuristic", // Clustering method
createdAt: timestamp
}
```
#### CCIP_MESSAGE_LINK
**Purpose**: Link transactions across chains via CCIP messages.
**Properties**:
```cypher
{
messageId: "0x...",
sourceTxHash: "0x...",
destTxHash: "0x...",
status: "delivered",
timestamp: timestamp
}
```
**Example**:
```cypher
(srcTx:Transaction)-[r:CCIP_MESSAGE_LINK {
messageId: "0x...",
status: "delivered"
}]->(destTx:Transaction)
```
## Query Patterns
### Find Token Holders
```cypher
MATCH (t:Token {address: "0x...", chainId: 138})-[r:OWNS]-(a:Address)
WHERE r.balance > "0"
RETURN a.address, r.balance
ORDER BY toFloat(r.balance) DESC
LIMIT 100;
```
### Find Transfer Path
```cypher
MATCH path = (a1:Address {address: "0x..."})-[:TRANSFERRED_TO*1..3]-(a2:Address {address: "0x..."})
WHERE ALL(r in relationships(path) WHERE r.tokenAddress = "0x...")
RETURN path
LIMIT 10;
```
### Find Protocol Users
```cypher
MATCH (a:Address)-[r:INTERACTS_WITH]->(p:Protocol {name: "Uniswap V3"})
RETURN a.address, count(r) as interactionCount
ORDER BY interactionCount DESC
LIMIT 100;
```
### Address Clustering
```cypher
MATCH (a1:Address)-[r:CLUSTERED_WITH]-(a2:Address)
WHERE a1.address = "0x..."
RETURN a2.address, r.confidence, r.method;
```
### Cross-Chain CCIP Links
```cypher
MATCH (srcTx:Transaction {hash: "0x..."})-[r:CCIP_MESSAGE_LINK]-(destTx:Transaction)
RETURN srcTx, r, destTx;
```
## Data Ingestion
### Transaction Ingestion
**Process**:
1. Process transaction from indexer
2. Create/update address nodes
3. Create TRANSFERRED_TO relationships for token transfers
4. Create CALLED relationships for contract calls
5. Update OWNS relationships for token balances
### Batch Ingestion
**Strategy**:
- Use Neo4j Batch API for bulk inserts
- Batch size: 1000-10000 operations
- Use transactions for atomicity
### Incremental Updates
**Process**:
- Update relationships as new transactions processed
- Maintain OWNS relationships (update balances)
- Add new relationships for new interactions
## Performance Optimization
### Indexing
**Indexes**:
```cypher
CREATE INDEX address_address FOR (a:Address) ON (a.address);
CREATE INDEX address_chain_id FOR (a:Address) ON (a.chainId);
CREATE INDEX transaction_hash FOR (t:Transaction) ON (t.hash);
```
### Relationship Constraints
**Uniqueness**: Use MERGE to avoid duplicate relationships
**Example**:
```cypher
MATCH (a1:Address {address: "0x...", chainId: 138})
MATCH (a2:Address {address: "0x...", chainId: 138})
MERGE (a1)-[r:TRANSFERRED_TO {
transactionHash: "0x..."
}]->(a2)
ON CREATE SET r.amount = "1000000", r.timestamp = timestamp();
```
## Data Retention
**Strategy**:
- Keep all current relationships
- Archive old relationships (older than 1 year) to separate database
- Keep aggregated statistics (interaction counts) instead of all relationships
## References
- Entity Graph: See `../multichain/entity-graph.md`
- CCIP Integration: See `../ccip/ccip-tracking.md`

View File

@@ -0,0 +1,517 @@
# PostgreSQL Database Schema Specification
## Overview
This document specifies the complete PostgreSQL database schema for the explorer platform. The schema is designed to support multi-chain operation, high-performance queries, and data consistency.
## Schema Design Principles
1. **Multi-chain Support**: All tables include `chain_id` for chain isolation
2. **Normalization**: Normalized structure to avoid data duplication
3. **Performance**: Strategic indexing for common query patterns
4. **Consistency**: Foreign key constraints where appropriate
5. **Extensibility**: JSONB columns for flexible data storage
6. **Partitioning**: Large tables partitioned by `chain_id`
## Core Tables
### Blocks Table
See `../indexing/data-models.md` for detailed block schema.
**Partitioning**: Partition by `chain_id` for large deployments.
**Key Indexes**:
- Primary: `(chain_id, number)`
- Unique: `(chain_id, hash)`
- Index: `(chain_id, timestamp)` for time-range queries
### Transactions Table
See `../indexing/data-models.md` for detailed transaction schema.
**Key Indexes**:
- Primary: `(chain_id, hash)`
- Index: `(chain_id, block_number, transaction_index)` for block queries
- Index: `(chain_id, from_address)` for address queries
- Index: `(chain_id, to_address)` for address queries
- Index: `(chain_id, block_number, from_address)` for compound queries
### Logs Table
See `../indexing/data-models.md` for detailed log schema.
**Key Indexes**:
- Primary: `(chain_id, transaction_hash, log_index)`
- Index: `(chain_id, address)` for contract event queries
- Index: `(chain_id, topic0)` for event type queries
- Index: `(chain_id, address, topic0)` for filtered event queries
- Index: `(chain_id, block_number)` for block-based queries
### Traces Table
See `../indexing/data-models.md` for detailed trace schema.
**Key Indexes**:
- Primary: `(chain_id, transaction_hash, trace_address)`
- Index: `(chain_id, action_from)` for address queries
- Index: `(chain_id, action_to)` for address queries
- Index: `(chain_id, block_number)` for block queries
### Internal Transactions Table
See `../indexing/data-models.md` for detailed internal transaction schema.
**Key Indexes**:
- Primary: `(chain_id, transaction_hash, trace_address)`
- Index: `(chain_id, from_address)`
- Index: `(chain_id, to_address)`
- Index: `(chain_id, block_number)`
## Token Tables
### Tokens Table
```sql
CREATE TABLE tokens (
id BIGSERIAL,
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
type VARCHAR(10) NOT NULL CHECK (type IN ('ERC20', 'ERC721', 'ERC1155')),
name VARCHAR(255),
symbol VARCHAR(50),
decimals INTEGER CHECK (decimals >= 0 AND decimals <= 18),
total_supply NUMERIC(78, 0),
holder_count INTEGER DEFAULT 0,
transfer_count INTEGER DEFAULT 0,
logo_url TEXT,
website_url TEXT,
description TEXT,
verified BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
UNIQUE (chain_id, address)
) PARTITION BY LIST (chain_id);
CREATE INDEX idx_tokens_chain_address ON tokens(chain_id, address);
CREATE INDEX idx_tokens_chain_type ON tokens(chain_id, type);
CREATE INDEX idx_tokens_chain_symbol ON tokens(chain_id, symbol);
```
### Token Transfers Table
```sql
CREATE TABLE token_transfers (
id BIGSERIAL,
chain_id INTEGER NOT NULL,
transaction_hash VARCHAR(66) NOT NULL,
block_number BIGINT NOT NULL,
log_index INTEGER NOT NULL,
token_address VARCHAR(42) NOT NULL,
token_type VARCHAR(10) NOT NULL CHECK (token_type IN ('ERC20', 'ERC721', 'ERC1155')),
from_address VARCHAR(42) NOT NULL,
to_address VARCHAR(42) NOT NULL,
amount NUMERIC(78, 0),
token_id VARCHAR(78),
operator VARCHAR(42),
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
FOREIGN KEY (chain_id, transaction_hash) REFERENCES transactions(chain_id, hash),
FOREIGN KEY (chain_id, token_address) REFERENCES tokens(chain_id, address),
UNIQUE (chain_id, transaction_hash, log_index)
) PARTITION BY LIST (chain_id);
CREATE INDEX idx_token_transfers_chain_token ON token_transfers(chain_id, token_address);
CREATE INDEX idx_token_transfers_chain_from ON token_transfers(chain_id, from_address);
CREATE INDEX idx_token_transfers_chain_to ON token_transfers(chain_id, to_address);
CREATE INDEX idx_token_transfers_chain_tx ON token_transfers(chain_id, transaction_hash);
CREATE INDEX idx_token_transfers_chain_block ON token_transfers(chain_id, block_number);
CREATE INDEX idx_token_transfers_chain_token_from ON token_transfers(chain_id, token_address, from_address);
CREATE INDEX idx_token_transfers_chain_token_to ON token_transfers(chain_id, token_address, to_address);
```
### Token Holders Table (Optional)
**Purpose**: Maintain current token balances for efficient queries.
```sql
CREATE TABLE token_holders (
id BIGSERIAL,
chain_id INTEGER NOT NULL,
token_address VARCHAR(42) NOT NULL,
address VARCHAR(42) NOT NULL,
balance NUMERIC(78, 0) NOT NULL DEFAULT 0,
token_id VARCHAR(78), -- For ERC-721/1155
updated_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
FOREIGN KEY (chain_id, token_address) REFERENCES tokens(chain_id, address),
UNIQUE (chain_id, token_address, address, COALESCE(token_id, ''))
) PARTITION BY LIST (chain_id);
CREATE INDEX idx_token_holders_chain_token ON token_holders(chain_id, token_address);
CREATE INDEX idx_token_holders_chain_address ON token_holders(chain_id, address);
```
## Contract Tables
### Contracts Table
```sql
CREATE TABLE contracts (
id BIGSERIAL,
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
name VARCHAR(255),
compiler_version VARCHAR(50),
optimization_enabled BOOLEAN,
optimization_runs INTEGER,
evm_version VARCHAR(20),
source_code TEXT,
abi JSONB,
constructor_arguments TEXT,
verification_status VARCHAR(20) NOT NULL CHECK (verification_status IN ('pending', 'verified', 'failed')),
verified_at TIMESTAMP,
verification_method VARCHAR(50),
license VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
UNIQUE (chain_id, address)
) PARTITION BY LIST (chain_id);
CREATE INDEX idx_contracts_chain_address ON contracts(chain_id, address);
CREATE INDEX idx_contracts_chain_verified ON contracts(chain_id, verification_status);
CREATE INDEX idx_contracts_abi_gin ON contracts USING GIN (abi); -- For ABI queries
```
### Contract ABIs Table
```sql
CREATE TABLE contract_abis (
id BIGSERIAL,
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
abi JSONB NOT NULL,
source VARCHAR(50) NOT NULL,
verified BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
UNIQUE (chain_id, address)
) PARTITION BY LIST (chain_id);
CREATE INDEX idx_abis_chain_address ON contract_abis(chain_id, address);
CREATE INDEX idx_abis_abi_gin ON contract_abis USING GIN (abi);
```
### Contract Verifications Table
```sql
CREATE TABLE contract_verifications (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
status VARCHAR(20) NOT NULL CHECK (status IN ('pending', 'processing', 'verified', 'failed', 'partially_verified')),
compiler_version VARCHAR(50),
optimization_enabled BOOLEAN,
optimization_runs INTEGER,
evm_version VARCHAR(20),
source_code TEXT,
abi JSONB,
constructor_arguments TEXT,
verification_method VARCHAR(50),
error_message TEXT,
verified_at TIMESTAMP,
version INTEGER DEFAULT 1,
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
FOREIGN KEY (chain_id, address) REFERENCES contracts(chain_id, address)
);
CREATE INDEX idx_verifications_chain_address ON contract_verifications(chain_id, address);
CREATE INDEX idx_verifications_status ON contract_verifications(status);
```
## Address-Related Tables
### Address Labels Table
```sql
CREATE TABLE address_labels (
id BIGSERIAL,
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
label VARCHAR(255) NOT NULL,
label_type VARCHAR(20) NOT NULL CHECK (label_type IN ('user', 'public', 'contract_name')),
user_id UUID,
source VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
UNIQUE (chain_id, address, label_type, user_id),
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
CREATE INDEX idx_labels_chain_address ON address_labels(chain_id, address);
CREATE INDEX idx_labels_chain_user ON address_labels(chain_id, user_id);
```
### Address Tags Table
```sql
CREATE TABLE address_tags (
id BIGSERIAL,
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
tag VARCHAR(50) NOT NULL,
tag_type VARCHAR(20) NOT NULL CHECK (tag_type IN ('category', 'risk', 'protocol')),
user_id UUID,
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
UNIQUE (chain_id, address, tag, user_id),
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
CREATE INDEX idx_tags_chain_address ON address_tags(chain_id, address);
CREATE INDEX idx_tags_chain_tag ON address_tags(chain_id, tag);
```
## User Tables
### Users Table
```sql
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE,
username VARCHAR(100) UNIQUE,
password_hash TEXT,
api_key_hash TEXT,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
last_login_at TIMESTAMP
);
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_username ON users(username);
```
### Watchlists Table
```sql
CREATE TABLE watchlists (
id BIGSERIAL,
user_id UUID NOT NULL,
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
label VARCHAR(255),
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (id),
UNIQUE (user_id, chain_id, address),
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
CREATE INDEX idx_watchlists_user ON watchlists(user_id);
CREATE INDEX idx_watchlists_chain_address ON watchlists(chain_id, address);
```
### API Keys Table
```sql
CREATE TABLE api_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
key_hash TEXT NOT NULL UNIQUE,
name VARCHAR(255),
tier VARCHAR(20) NOT NULL CHECK (tier IN ('free', 'pro', 'enterprise')),
rate_limit_per_second INTEGER,
rate_limit_per_minute INTEGER,
ip_whitelist TEXT[], -- Array of CIDR blocks
last_used_at TIMESTAMP,
expires_at TIMESTAMP,
revoked BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT NOW(),
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
CREATE INDEX idx_api_keys_user ON api_keys(user_id);
CREATE INDEX idx_api_keys_hash ON api_keys(key_hash);
```
## Multi-Chain Partitioning
### Partitioning Strategy
**Large Tables**: Partition by `chain_id` using LIST partitioning.
**Tables to Partition**:
- `blocks`
- `transactions`
- `logs`
- `traces`
- `internal_transactions`
- `token_transfers`
- `tokens`
- `token_holders` (if used)
### Partition Creation
**Example for blocks table**:
```sql
-- Create parent table
CREATE TABLE blocks (
-- columns
) PARTITION BY LIST (chain_id);
-- Create partitions
CREATE TABLE blocks_chain_138 PARTITION OF blocks
FOR VALUES IN (138);
CREATE TABLE blocks_chain_1 PARTITION OF blocks
FOR VALUES IN (1);
-- Add indexes to partitions (inherited from parent)
```
**Benefits**:
- Faster queries (partition pruning)
- Easier maintenance (per-chain operations)
- Parallel processing
- Data isolation
## Indexing Strategy
### Index Types
1. **B-tree**: Default for most indexes (equality, range, sorting)
2. **Hash**: For exact match only (rarely used, B-tree usually better)
3. **GIN**: For JSONB columns (ABIs, decoded data)
4. **BRIN**: For large ordered columns (block numbers, timestamps)
5. **Partial**: For filtered indexes (e.g., verified contracts only)
### Index Maintenance
**Regular Maintenance**:
- `VACUUM ANALYZE` regularly (auto-vacuum enabled)
- `REINDEX` if needed (bloat, corruption)
- Monitor index usage (`pg_stat_user_indexes`)
**Index Monitoring**:
- Track index sizes
- Monitor index bloat
- Remove unused indexes
## Data Retention and Archiving
### Retention Policies
**Hot Data**: Recent data (last 1 year)
- Fast access required
- All indexes maintained
**Warm Data**: Older data (1-5 years)
- Archive to slower storage
- Reduced indexing
**Cold Data**: Very old data (5+ years)
- Archive to object storage
- Minimal indexing
### Archiving Strategy
**Approach**:
1. Partition tables by time ranges (monthly/yearly)
2. Move old partitions to archive storage
3. Query archive when needed (slower but available)
**Implementation**:
- Use PostgreSQL table partitioning by date range
- Move partitions to archive storage (S3, etc.)
- Query via foreign data wrappers if needed
## Migration Strategy
### Versioning
**Migration Tool**: Use migration tool (Flyway, Liquibase, or custom).
**Versioning Format**: `YYYYMMDDHHMMSS_description.sql`
**Example**:
```
20240101000001_initial_schema.sql
20240115000001_add_token_holders.sql
20240201000001_add_partitioning.sql
```
### Migration Best Practices
1. **Backward Compatible**: Additive changes preferred
2. **Reversible**: All migrations should be reversible
3. **Tested**: Test on staging before production
4. **Documented**: Document breaking changes
5. **Rollback Plan**: Have rollback strategy
### Schema Evolution
**Adding Columns**:
- Use `ALTER TABLE ADD COLUMN` with default values
- Avoid NOT NULL without defaults (use two-step migration)
**Removing Columns**:
- Mark as deprecated first
- Remove after migration period
**Changing Types**:
- Create new column
- Migrate data
- Drop old column
- Rename new column
## Performance Optimization
### Query Optimization
**Common Query Patterns**:
1. Get block by number: Use `(chain_id, number)` index
2. Get transaction by hash: Use `(chain_id, hash)` index
3. Get address transactions: Use `(chain_id, from_address)` or `(chain_id, to_address)` index
4. Filter logs by address and event: Use `(chain_id, address, topic0)` index
### Connection Pooling
**Configuration**:
- Use connection pooler (PgBouncer, pgpool-II)
- Pool size: 20-100 connections per application server
- Statement-level pooling for better concurrency
### Read Replicas
**Strategy**:
- Primary: Write operations
- Replicas: Read operations (load balanced)
- Async replication (small lag acceptable)
## Backup and Recovery
### Backup Strategy
**Full Backups**: Daily full database dumps
**Incremental Backups**: Continuous WAL archiving
**Point-in-Time Recovery**: Enabled via WAL archiving
### Recovery Procedures
**RTO Target**: 1 hour
**RPO Target**: 5 minutes (max data loss)
## References
- Data Models: See `../indexing/data-models.md`
- Indexer Architecture: See `../indexing/indexer-architecture.md`
- Search Index Schema: See `search-index-schema.md`
- Multi-chain Architecture: See `../multichain/multichain-indexing.md`

View File

@@ -0,0 +1,458 @@
# Search Index Schema Specification
## Overview
This document specifies the Elasticsearch/OpenSearch index schema for full-text search and faceted querying across blocks, transactions, addresses, tokens, and contracts.
## Architecture
```mermaid
flowchart LR
PG[(PostgreSQL<br/>Canonical Data)]
Transform[Data Transformer]
ES[(Elasticsearch<br/>Search Index)]
PG --> Transform
Transform --> ES
Query[Search Query]
Query --> ES
ES --> Results[Search Results]
```
## Index Structure
### Blocks Index
**Index Name**: `blocks-{chain_id}` (e.g., `blocks-138`)
**Document Structure**:
```json
{
"block_number": 12345,
"hash": "0x...",
"timestamp": "2024-01-01T00:00:00Z",
"miner": "0x...",
"transaction_count": 100,
"gas_used": 15000000,
"gas_limit": 20000000,
"chain_id": 138,
"parent_hash": "0x...",
"size": 1024
}
```
**Field Mappings**:
- `block_number`: `long` (not analyzed, for sorting/filtering)
- `hash`: `keyword` (exact match)
- `timestamp`: `date`
- `miner`: `keyword` (exact match)
- `transaction_count`: `integer`
- `gas_used`: `long`
- `gas_limit`: `long`
- `chain_id`: `integer`
- `parent_hash`: `keyword`
**Searchable Fields**:
- Hash (exact match)
- Miner address (exact match)
### Transactions Index
**Index Name**: `transactions-{chain_id}`
**Document Structure**:
```json
{
"hash": "0x...",
"block_number": 12345,
"transaction_index": 5,
"from_address": "0x...",
"to_address": "0x...",
"value": "1000000000000000000",
"gas_price": "20000000000",
"gas_used": 21000,
"status": "success",
"timestamp": "2024-01-01T00:00:00Z",
"chain_id": 138,
"input_data_length": 100,
"is_contract_creation": false,
"contract_address": null
}
```
**Field Mappings**:
- `hash`: `keyword`
- `block_number`: `long`
- `transaction_index`: `integer`
- `from_address`: `keyword`
- `to_address`: `keyword`
- `value`: `text` (for full-text search on large numbers)
- `value_numeric`: `long` (for range queries)
- `gas_price`: `long`
- `gas_used`: `long`
- `status`: `keyword`
- `timestamp`: `date`
- `chain_id`: `integer`
- `input_data_length`: `integer`
- `is_contract_creation`: `boolean`
- `contract_address`: `keyword`
**Searchable Fields**:
- Hash (exact match)
- From/to addresses (exact match)
- Value (range queries)
### Addresses Index
**Index Name**: `addresses-{chain_id}`
**Document Structure**:
```json
{
"address": "0x...",
"chain_id": 138,
"label": "My Wallet",
"tags": ["wallet", "exchange"],
"token_count": 10,
"transaction_count": 500,
"first_seen": "2024-01-01T00:00:00Z",
"last_seen": "2024-01-15T00:00:00Z",
"is_contract": true,
"contract_name": "MyToken",
"balance_eth": "1.5",
"balance_usd": "3000"
}
```
**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `label`: `text` (analyzed) + `keyword` (exact match)
- `tags`: `keyword` (array)
- `token_count`: `integer`
- `transaction_count`: `long`
- `first_seen`: `date`
- `last_seen`: `date`
- `is_contract`: `boolean`
- `contract_name`: `text` + `keyword`
- `balance_eth`: `double`
- `balance_usd`: `double`
**Searchable Fields**:
- Address (exact match, prefix match)
- Label (full-text search)
- Contract name (full-text search)
- Tags (facet filter)
### Tokens Index
**Index Name**: `tokens-{chain_id}`
**Document Structure**:
```json
{
"address": "0x...",
"chain_id": 138,
"name": "My Token",
"symbol": "MTK",
"type": "ERC20",
"decimals": 18,
"total_supply": "1000000000000000000000000",
"holder_count": 1000,
"transfer_count": 50000,
"logo_url": "https://...",
"verified": true,
"description": "A token description"
}
```
**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` (analyzed) + `keyword` (exact match)
- `symbol`: `keyword` (uppercase normalized)
- `type`: `keyword`
- `decimals`: `integer`
- `total_supply`: `text` (for large numbers)
- `total_supply_numeric`: `double` (for sorting)
- `holder_count`: `integer`
- `transfer_count`: `long`
- `logo_url`: `keyword`
- `verified`: `boolean`
- `description`: `text` (analyzed)
**Searchable Fields**:
- Name (full-text search)
- Symbol (exact match, prefix match)
- Address (exact match)
### Contracts Index
**Index Name**: `contracts-{chain_id}`
**Document Structure**:
```json
{
"address": "0x...",
"chain_id": 138,
"name": "MyContract",
"verification_status": "verified",
"compiler_version": "0.8.19",
"source_code": "contract MyContract {...}",
"abi": [...],
"verified_at": "2024-01-01T00:00:00Z",
"transaction_count": 1000,
"created_at": "2024-01-01T00:00:00Z"
}
```
**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` + `keyword`
- `verification_status`: `keyword`
- `compiler_version`: `keyword`
- `source_code`: `text` (analyzed, indexed but not stored in full for large contracts)
- `abi`: `object` (nested, for structured queries)
- `verified_at`: `date`
- `transaction_count`: `long`
- `created_at`: `date`
**Searchable Fields**:
- Name (full-text search)
- Address (exact match)
- Source code (full-text search, limited)
## Indexing Pipeline
### Data Transformation
**Purpose**: Transform canonical PostgreSQL data into search-optimized documents.
**Transformation Steps**:
1. **Fetch Data**: Query PostgreSQL for entities to index
2. **Enrich Data**: Add computed fields (balances, counts, etc.)
3. **Normalize Data**: Normalize addresses, format values
4. **Index Document**: Send to Elasticsearch/OpenSearch
### Indexing Strategy
**Initial Indexing**:
- Bulk index existing data
- Process in batches (1000 documents per batch)
- Use bulk API for efficiency
**Incremental Indexing**:
- Index new entities as they're created
- Update entities when changed
- Delete entities when removed
**Update Frequency**:
- Real-time: Index immediately after database insert/update
- Batch: Bulk update every N minutes for efficiency
### Index Aliases
**Purpose**: Enable zero-downtime index updates.
**Strategy**:
- Write to new index (e.g., `blocks-138-v2`)
- Build index in background
- Switch alias when ready
- Delete old index after switch
**Alias Names**:
- `blocks-{chain_id}` → points to latest version
- `transactions-{chain_id}` → points to latest version
- etc.
## Query Patterns
### Full-Text Search
**Blocks Search**:
```json
{
"query": {
"match": {
"hash": "0x123..."
}
}
}
```
**Address Search**:
```json
{
"query": {
"bool": {
"should": [
{ "match": { "label": "wallet" } },
{ "prefix": { "address": "0x123" } }
]
}
}
}
```
**Token Search**:
```json
{
"query": {
"bool": {
"should": [
{ "match": { "name": "My Token" } },
{ "match": { "symbol": "MTK" } }
]
}
}
}
```
### Faceted Search
**Filter by Multiple Criteria**:
```json
{
"query": {
"bool": {
"must": [
{ "term": { "chain_id": 138 } },
{ "term": { "type": "ERC20" } },
{ "range": { "holder_count": { "gte": 100 } } }
]
}
},
"aggs": {
"by_type": {
"terms": { "field": "type" }
}
}
}
```
### Unified Search
**Cross-Entity Search**:
- Search across blocks, transactions, addresses, tokens
- Use `_index` field to filter by entity type
- Combine results with relevance scoring
**Multi-Index Query**:
```json
{
"query": {
"multi_match": {
"query": "0x123",
"fields": ["hash", "address", "from_address", "to_address"],
"type": "best_fields"
}
}
}
```
## Index Configuration
### Analysis Settings
**Custom Analyzer**:
- Address analyzer: Lowercase, no tokenization
- Symbol analyzer: Uppercase, no tokenization
- Text analyzer: Standard analyzer with lowercase
**Example Configuration**:
```json
{
"settings": {
"analysis": {
"analyzer": {
"address_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
}
```
### Sharding and Replication
**Sharding**:
- Number of shards: Based on index size
- Large indices (> 50GB): Multiple shards
- Small indices: Single shard
**Replication**:
- Replica count: 1-2 (for high availability)
- Increase replicas for read-heavy workloads
## Performance Optimization
### Index Optimization
**Refresh Interval**:
- Default: 1 second
- For bulk indexing: Increase to 30 seconds, then reset
**Bulk Indexing**:
- Batch size: 1000-5000 documents
- Use bulk API
- Disable refresh during bulk indexing
### Query Optimization
**Query Caching**:
- Enable query cache for repeated queries
- Cache filter results
**Field Data**:
- Use `doc_values` for sorting/aggregations
- Avoid `fielddata` for text fields
## Maintenance
### Index Monitoring
**Metrics**:
- Index size
- Document count
- Query performance (p50, p95, p99)
- Index lag (time behind database)
### Index Cleanup
**Strategy**:
- Delete old indices (after alias switch)
- Archive old indices to cold storage
- Compress indices for storage efficiency
## Integration with PostgreSQL
### Data Sync
**Sync Strategy**:
- Real-time: Listen to database changes (CDC, triggers, or polling)
- Batch: Periodic sync jobs
- Hybrid: Real-time for recent data, batch for historical
**Change Detection**:
- Use `updated_at` timestamp
- Use database triggers to queue changes
- Use CDC (Change Data Capture) if available
### Consistency
**Eventual Consistency**:
- Search index is eventually consistent with database
- Small lag acceptable (< 1 minute)
- Critical queries can fall back to database
## References
- Database Schema: See `postgres-schema.md`
- Indexer Architecture: See `../indexing/indexer-architecture.md`
- Unified Search: See `../multichain/unified-search.md`

View File

@@ -0,0 +1,239 @@
# Time-Series Database Schema Specification
## Overview
This document specifies the time-series database schema using ClickHouse or TimescaleDB for storing mempool data, metrics, and analytics time-series data.
## Technology Choice
**Option 1: TimescaleDB** (PostgreSQL extension)
- Pros: PostgreSQL compatibility, SQL interface, easier integration
- Cons: Less optimized for very high throughput
**Option 2: ClickHouse**
- Pros: Very high performance, columnar storage, excellent compression
- Cons: Different SQL dialect, separate infrastructure
**Recommendation**: Start with TimescaleDB for easier integration, migrate to ClickHouse if needed for scale.
## TimescaleDB Schema
### Mempool Transactions Table
**Table**: `mempool_transactions`
```sql
CREATE TABLE mempool_transactions (
time TIMESTAMPTZ NOT NULL,
chain_id INTEGER NOT NULL,
hash VARCHAR(66) NOT NULL,
from_address VARCHAR(42) NOT NULL,
to_address VARCHAR(42),
value NUMERIC(78, 0),
gas_price BIGINT,
max_fee_per_gas BIGINT,
max_priority_fee_per_gas BIGINT,
gas_limit BIGINT,
nonce BIGINT,
input_data_length INTEGER,
first_seen TIMESTAMPTZ NOT NULL,
status VARCHAR(20) DEFAULT 'pending', -- 'pending', 'confirmed', 'dropped'
confirmed_block_number BIGINT,
confirmed_at TIMESTAMPTZ,
PRIMARY KEY (time, chain_id, hash)
);
SELECT create_hypertable('mempool_transactions', 'time');
CREATE INDEX idx_mempool_chain_hash ON mempool_transactions(chain_id, hash);
CREATE INDEX idx_mempool_chain_from ON mempool_transactions(chain_id, from_address);
CREATE INDEX idx_mempool_chain_status ON mempool_transactions(chain_id, status, time);
```
**Retention Policy**: 7 days for detailed data, aggregates for longer periods
### Network Metrics Table
**Table**: `network_metrics`
```sql
CREATE TABLE network_metrics (
time TIMESTAMPTZ NOT NULL,
chain_id INTEGER NOT NULL,
block_number BIGINT,
tps DOUBLE PRECISION, -- Transactions per second
gps DOUBLE PRECISION, -- Gas per second
avg_gas_price BIGINT,
pending_transactions INTEGER,
block_time_seconds DOUBLE PRECISION,
PRIMARY KEY (time, chain_id)
);
SELECT create_hypertable('network_metrics', 'time');
CREATE INDEX idx_network_metrics_chain_time ON network_metrics(chain_id, time DESC);
```
**Aggregation**: Pre-aggregate to 1-minute, 5-minute, 1-hour intervals
### Gas Price History Table
**Table**: `gas_price_history`
```sql
CREATE TABLE gas_price_history (
time TIMESTAMPTZ NOT NULL,
chain_id INTEGER NOT NULL,
block_number BIGINT,
min_gas_price BIGINT,
max_gas_price BIGINT,
avg_gas_price BIGINT,
p25_gas_price BIGINT, -- 25th percentile
p50_gas_price BIGINT, -- 50th percentile (median)
p75_gas_price BIGINT, -- 75th percentile
p95_gas_price BIGINT, -- 95th percentile
p99_gas_price BIGINT, -- 99th percentile
PRIMARY KEY (time, chain_id)
);
SELECT create_hypertable('gas_price_history', 'time');
```
### Address Activity Metrics Table
**Table**: `address_activity_metrics`
```sql
CREATE TABLE address_activity_metrics (
time TIMESTAMPTZ NOT NULL,
chain_id INTEGER NOT NULL,
address VARCHAR(42) NOT NULL,
transaction_count INTEGER,
received_count INTEGER,
sent_count INTEGER,
total_received NUMERIC(78, 0),
total_sent NUMERIC(78, 0),
PRIMARY KEY (time, chain_id, address)
);
SELECT create_hypertable('address_activity_metrics', 'time',
chunk_time_interval => INTERVAL '1 day');
CREATE INDEX idx_address_activity_chain_address ON address_activity_metrics(chain_id, address, time DESC);
```
**Aggregation**: Pre-aggregate to hourly/daily for addresses
## ClickHouse Schema (Alternative)
### Mempool Transactions Table
```sql
CREATE TABLE mempool_transactions (
time DateTime('UTC') NOT NULL,
chain_id UInt32 NOT NULL,
hash String NOT NULL,
from_address String NOT NULL,
to_address Nullable(String),
value Decimal128(0),
gas_price UInt64,
max_fee_per_gas Nullable(UInt64),
max_priority_fee_per_gas Nullable(UInt64),
gas_limit UInt64,
nonce UInt64,
input_data_length UInt32,
first_seen DateTime('UTC') NOT NULL,
status String DEFAULT 'pending',
confirmed_block_number Nullable(UInt64),
confirmed_at Nullable(DateTime('UTC'))
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(time)
ORDER BY (chain_id, time, hash)
TTL time + INTERVAL 7 DAY; -- Auto-delete after 7 days
```
## Data Retention and Aggregation
### Retention Policies
**Raw Data**:
- Mempool transactions: 7 days
- Network metrics: 30 days
- Gas price history: 90 days
- Address activity: 30 days
**Aggregated Data**:
- 1-minute aggregates: 90 days
- 5-minute aggregates: 1 year
- 1-hour aggregates: 5 years
- Daily aggregates: Indefinite
### Continuous Aggregates (TimescaleDB)
```sql
-- 1-minute network metrics aggregate
CREATE MATERIALIZED VIEW network_metrics_1m
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 minute', time) AS bucket,
chain_id,
AVG(tps) AS avg_tps,
AVG(gps) AS avg_gps,
AVG(avg_gas_price) AS avg_gas_price,
AVG(pending_transactions) AS avg_pending_tx
FROM network_metrics
GROUP BY bucket, chain_id;
-- Add refresh policy
SELECT add_continuous_aggregate_policy('network_metrics_1m',
start_offset => INTERVAL '1 hour',
end_offset => INTERVAL '1 minute',
schedule_interval => INTERVAL '1 minute');
```
## Query Patterns
### Recent Mempool Transactions
```sql
SELECT * FROM mempool_transactions
WHERE chain_id = 138
AND time > NOW() - INTERVAL '1 hour'
AND status = 'pending'
ORDER BY time DESC
LIMIT 100;
```
### Gas Price Statistics
```sql
SELECT
time_bucket('5 minutes', time) AS bucket,
AVG(avg_gas_price) AS avg_gas_price,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY avg_gas_price) AS median_gas_price
FROM gas_price_history
WHERE chain_id = 138
AND time > NOW() - INTERVAL '24 hours'
GROUP BY bucket
ORDER BY bucket DESC;
```
### Network Throughput
```sql
SELECT
time_bucket('1 minute', time) AS bucket,
AVG(tps) AS avg_tps,
MAX(tps) AS max_tps
FROM network_metrics
WHERE chain_id = 138
AND time > NOW() - INTERVAL '1 hour'
GROUP BY bucket
ORDER BY bucket DESC;
```
## References
- Mempool Service: See `../mempool/mempool-service.md`
- Observability: See `../observability/metrics-monitoring.md`