Files
explorer-monorepo/docs/specs/database/search-index-schema.md

459 lines
9.5 KiB
Markdown
Raw Normal View History

# Search Index Schema Specification
## Overview
This document specifies the Elasticsearch/OpenSearch index schema for full-text search and faceted querying across blocks, transactions, addresses, tokens, and contracts.
## Architecture
```mermaid
flowchart LR
PG[(PostgreSQL<br/>Canonical Data)]
Transform[Data Transformer]
ES[(Elasticsearch<br/>Search Index)]
PG --> Transform
Transform --> ES
Query[Search Query]
Query --> ES
ES --> Results[Search Results]
```
## Index Structure
### Blocks Index
**Index Name**: `blocks-{chain_id}` (e.g., `blocks-138`)
**Document Structure**:
```json
{
"block_number": 12345,
"hash": "0x...",
"timestamp": "2024-01-01T00:00:00Z",
"miner": "0x...",
"transaction_count": 100,
"gas_used": 15000000,
"gas_limit": 20000000,
"chain_id": 138,
"parent_hash": "0x...",
"size": 1024
}
```
**Field Mappings**:
- `block_number`: `long` (not analyzed, for sorting/filtering)
- `hash`: `keyword` (exact match)
- `timestamp`: `date`
- `miner`: `keyword` (exact match)
- `transaction_count`: `integer`
- `gas_used`: `long`
- `gas_limit`: `long`
- `chain_id`: `integer`
- `parent_hash`: `keyword`
**Searchable Fields**:
- Hash (exact match)
- Miner address (exact match)
### Transactions Index
**Index Name**: `transactions-{chain_id}`
**Document Structure**:
```json
{
"hash": "0x...",
"block_number": 12345,
"transaction_index": 5,
"from_address": "0x...",
"to_address": "0x...",
"value": "1000000000000000000",
"gas_price": "20000000000",
"gas_used": 21000,
"status": "success",
"timestamp": "2024-01-01T00:00:00Z",
"chain_id": 138,
"input_data_length": 100,
"is_contract_creation": false,
"contract_address": null
}
```
**Field Mappings**:
- `hash`: `keyword`
- `block_number`: `long`
- `transaction_index`: `integer`
- `from_address`: `keyword`
- `to_address`: `keyword`
- `value`: `text` (for full-text search on large numbers)
- `value_numeric`: `long` (for range queries)
- `gas_price`: `long`
- `gas_used`: `long`
- `status`: `keyword`
- `timestamp`: `date`
- `chain_id`: `integer`
- `input_data_length`: `integer`
- `is_contract_creation`: `boolean`
- `contract_address`: `keyword`
**Searchable Fields**:
- Hash (exact match)
- From/to addresses (exact match)
- Value (range queries)
### Addresses Index
**Index Name**: `addresses-{chain_id}`
**Document Structure**:
```json
{
"address": "0x...",
"chain_id": 138,
"label": "My Wallet",
"tags": ["wallet", "exchange"],
"token_count": 10,
"transaction_count": 500,
"first_seen": "2024-01-01T00:00:00Z",
"last_seen": "2024-01-15T00:00:00Z",
"is_contract": true,
"contract_name": "MyToken",
"balance_eth": "1.5",
"balance_usd": "3000"
}
```
**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `label`: `text` (analyzed) + `keyword` (exact match)
- `tags`: `keyword` (array)
- `token_count`: `integer`
- `transaction_count`: `long`
- `first_seen`: `date`
- `last_seen`: `date`
- `is_contract`: `boolean`
- `contract_name`: `text` + `keyword`
- `balance_eth`: `double`
- `balance_usd`: `double`
**Searchable Fields**:
- Address (exact match, prefix match)
- Label (full-text search)
- Contract name (full-text search)
- Tags (facet filter)
### Tokens Index
**Index Name**: `tokens-{chain_id}`
**Document Structure**:
```json
{
"address": "0x...",
"chain_id": 138,
"name": "My Token",
"symbol": "MTK",
"type": "ERC20",
"decimals": 18,
"total_supply": "1000000000000000000000000",
"holder_count": 1000,
"transfer_count": 50000,
"logo_url": "https://...",
"verified": true,
"description": "A token description"
}
```
**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` (analyzed) + `keyword` (exact match)
- `symbol`: `keyword` (uppercase normalized)
- `type`: `keyword`
- `decimals`: `integer`
- `total_supply`: `text` (for large numbers)
- `total_supply_numeric`: `double` (for sorting)
- `holder_count`: `integer`
- `transfer_count`: `long`
- `logo_url`: `keyword`
- `verified`: `boolean`
- `description`: `text` (analyzed)
**Searchable Fields**:
- Name (full-text search)
- Symbol (exact match, prefix match)
- Address (exact match)
### Contracts Index
**Index Name**: `contracts-{chain_id}`
**Document Structure**:
```json
{
"address": "0x...",
"chain_id": 138,
"name": "MyContract",
"verification_status": "verified",
"compiler_version": "0.8.19",
"source_code": "contract MyContract {...}",
"abi": [...],
"verified_at": "2024-01-01T00:00:00Z",
"transaction_count": 1000,
"created_at": "2024-01-01T00:00:00Z"
}
```
**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` + `keyword`
- `verification_status`: `keyword`
- `compiler_version`: `keyword`
- `source_code`: `text` (analyzed, indexed but not stored in full for large contracts)
- `abi`: `object` (nested, for structured queries)
- `verified_at`: `date`
- `transaction_count`: `long`
- `created_at`: `date`
**Searchable Fields**:
- Name (full-text search)
- Address (exact match)
- Source code (full-text search, limited)
## Indexing Pipeline
### Data Transformation
**Purpose**: Transform canonical PostgreSQL data into search-optimized documents.
**Transformation Steps**:
1. **Fetch Data**: Query PostgreSQL for entities to index
2. **Enrich Data**: Add computed fields (balances, counts, etc.)
3. **Normalize Data**: Normalize addresses, format values
4. **Index Document**: Send to Elasticsearch/OpenSearch
### Indexing Strategy
**Initial Indexing**:
- Bulk index existing data
- Process in batches (1000 documents per batch)
- Use bulk API for efficiency
**Incremental Indexing**:
- Index new entities as they're created
- Update entities when changed
- Delete entities when removed
**Update Frequency**:
- Real-time: Index immediately after database insert/update
- Batch: Bulk update every N minutes for efficiency
### Index Aliases
**Purpose**: Enable zero-downtime index updates.
**Strategy**:
- Write to new index (e.g., `blocks-138-v2`)
- Build index in background
- Switch alias when ready
- Delete old index after switch
**Alias Names**:
- `blocks-{chain_id}` → points to latest version
- `transactions-{chain_id}` → points to latest version
- etc.
## Query Patterns
### Full-Text Search
**Blocks Search**:
```json
{
"query": {
"match": {
"hash": "0x123..."
}
}
}
```
**Address Search**:
```json
{
"query": {
"bool": {
"should": [
{ "match": { "label": "wallet" } },
{ "prefix": { "address": "0x123" } }
]
}
}
}
```
**Token Search**:
```json
{
"query": {
"bool": {
"should": [
{ "match": { "name": "My Token" } },
{ "match": { "symbol": "MTK" } }
]
}
}
}
```
### Faceted Search
**Filter by Multiple Criteria**:
```json
{
"query": {
"bool": {
"must": [
{ "term": { "chain_id": 138 } },
{ "term": { "type": "ERC20" } },
{ "range": { "holder_count": { "gte": 100 } } }
]
}
},
"aggs": {
"by_type": {
"terms": { "field": "type" }
}
}
}
```
### Unified Search
**Cross-Entity Search**:
- Search across blocks, transactions, addresses, tokens
- Use `_index` field to filter by entity type
- Combine results with relevance scoring
**Multi-Index Query**:
```json
{
"query": {
"multi_match": {
"query": "0x123",
"fields": ["hash", "address", "from_address", "to_address"],
"type": "best_fields"
}
}
}
```
## Index Configuration
### Analysis Settings
**Custom Analyzer**:
- Address analyzer: Lowercase, no tokenization
- Symbol analyzer: Uppercase, no tokenization
- Text analyzer: Standard analyzer with lowercase
**Example Configuration**:
```json
{
"settings": {
"analysis": {
"analyzer": {
"address_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
}
```
### Sharding and Replication
**Sharding**:
- Number of shards: Based on index size
- Large indices (> 50GB): Multiple shards
- Small indices: Single shard
**Replication**:
- Replica count: 1-2 (for high availability)
- Increase replicas for read-heavy workloads
## Performance Optimization
### Index Optimization
**Refresh Interval**:
- Default: 1 second
- For bulk indexing: Increase to 30 seconds, then reset
**Bulk Indexing**:
- Batch size: 1000-5000 documents
- Use bulk API
- Disable refresh during bulk indexing
### Query Optimization
**Query Caching**:
- Enable query cache for repeated queries
- Cache filter results
**Field Data**:
- Use `doc_values` for sorting/aggregations
- Avoid `fielddata` for text fields
## Maintenance
### Index Monitoring
**Metrics**:
- Index size
- Document count
- Query performance (p50, p95, p99)
- Index lag (time behind database)
### Index Cleanup
**Strategy**:
- Delete old indices (after alias switch)
- Archive old indices to cold storage
- Compress indices for storage efficiency
## Integration with PostgreSQL
### Data Sync
**Sync Strategy**:
- Real-time: Listen to database changes (CDC, triggers, or polling)
- Batch: Periodic sync jobs
- Hybrid: Real-time for recent data, batch for historical
**Change Detection**:
- Use `updated_at` timestamp
- Use database triggers to queue changes
- Use CDC (Change Data Capture) if available
### Consistency
**Eventual Consistency**:
- Search index is eventually consistent with database
- Small lag acceptable (< 1 minute)
- Critical queries can fall back to database
## References
- Database Schema: See `postgres-schema.md`
- Indexer Architecture: See `../indexing/indexer-architecture.md`
- Unified Search: See `../multichain/unified-search.md`