docs/specs/database/search-index-schema.md

# Search Index Schema Specification

## Overview

This document specifies the Elasticsearch/OpenSearch index schema for full-text search and faceted querying across blocks, transactions, addresses, tokens, and contracts.

## Architecture

```mermaid
flowchart LR
    PG[(PostgreSQL<br/>Canonical Data)]
    Transform[Data Transformer]
    ES[(Elasticsearch<br/>Search Index)]
    
    PG --> Transform
    Transform --> ES
    
    Query[Search Query]
    Query --> ES
    ES --> Results[Search Results]
```

## Index Structure

### Blocks Index

**Index Name**: `blocks-{chain_id}` (e.g., `blocks-138`)

**Document Structure**:
```json
{
  "block_number": 12345,
  "hash": "0x...",
  "timestamp": "2024-01-01T00:00:00Z",
  "miner": "0x...",
  "transaction_count": 100,
  "gas_used": 15000000,
  "gas_limit": 20000000,
  "chain_id": 138,
  "parent_hash": "0x...",
  "size": 1024
}
```

**Field Mappings**:
- `block_number`: `long` (not analyzed, for sorting/filtering)
- `hash`: `keyword` (exact match)
- `timestamp`: `date`
- `miner`: `keyword` (exact match)
- `transaction_count`: `integer`
- `gas_used`: `long`
- `gas_limit`: `long`
- `chain_id`: `integer`
- `parent_hash`: `keyword`

**Searchable Fields**:
- Hash (exact match)
- Miner address (exact match)

### Transactions Index

**Index Name**: `transactions-{chain_id}`

**Document Structure**:
```json
{
  "hash": "0x...",
  "block_number": 12345,
  "transaction_index": 5,
  "from_address": "0x...",
  "to_address": "0x...",
  "value": "1000000000000000000",
  "gas_price": "20000000000",
  "gas_used": 21000,
  "status": "success",
  "timestamp": "2024-01-01T00:00:00Z",
  "chain_id": 138,
  "input_data_length": 100,
  "is_contract_creation": false,
  "contract_address": null
}
```

**Field Mappings**:
- `hash`: `keyword`
- `block_number`: `long`
- `transaction_index`: `integer`
- `from_address`: `keyword`
- `to_address`: `keyword`
- `value`: `text` (for full-text search on large numbers)
- `value_numeric`: `long` (for range queries)
- `gas_price`: `long`
- `gas_used`: `long`
- `status`: `keyword`
- `timestamp`: `date`
- `chain_id`: `integer`
- `input_data_length`: `integer`
- `is_contract_creation`: `boolean`
- `contract_address`: `keyword`

**Searchable Fields**:
- Hash (exact match)
- From/to addresses (exact match)
- Value (range queries)

### Addresses Index

**Index Name**: `addresses-{chain_id}`

**Document Structure**:
```json
{
  "address": "0x...",
  "chain_id": 138,
  "label": "My Wallet",
  "tags": ["wallet", "exchange"],
  "token_count": 10,
  "transaction_count": 500,
  "first_seen": "2024-01-01T00:00:00Z",
  "last_seen": "2024-01-15T00:00:00Z",
  "is_contract": true,
  "contract_name": "MyToken",
  "balance_eth": "1.5",
  "balance_usd": "3000"
}
```

**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `label`: `text` (analyzed) + `keyword` (exact match)
- `tags`: `keyword` (array)
- `token_count`: `integer`
- `transaction_count`: `long`
- `first_seen`: `date`
- `last_seen`: `date`
- `is_contract`: `boolean`
- `contract_name`: `text` + `keyword`
- `balance_eth`: `double`
- `balance_usd`: `double`

**Searchable Fields**:
- Address (exact match, prefix match)
- Label (full-text search)
- Contract name (full-text search)
- Tags (facet filter)

### Tokens Index

**Index Name**: `tokens-{chain_id}`

**Document Structure**:
```json
{
  "address": "0x...",
  "chain_id": 138,
  "name": "My Token",
  "symbol": "MTK",
  "type": "ERC20",
  "decimals": 18,
  "total_supply": "1000000000000000000000000",
  "holder_count": 1000,
  "transfer_count": 50000,
  "logo_url": "https://...",
  "verified": true,
  "description": "A token description"
}
```

**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` (analyzed) + `keyword` (exact match)
- `symbol`: `keyword` (uppercase normalized)
- `type`: `keyword`
- `decimals`: `integer`
- `total_supply`: `text` (for large numbers)
- `total_supply_numeric`: `double` (for sorting)
- `holder_count`: `integer`
- `transfer_count`: `long`
- `logo_url`: `keyword`
- `verified`: `boolean`
- `description`: `text` (analyzed)

**Searchable Fields**:
- Name (full-text search)
- Symbol (exact match, prefix match)
- Address (exact match)

### Contracts Index

**Index Name**: `contracts-{chain_id}`

**Document Structure**:
```json
{
  "address": "0x...",
  "chain_id": 138,
  "name": "MyContract",
  "verification_status": "verified",
  "compiler_version": "0.8.19",
  "source_code": "contract MyContract {...}",
  "abi": [...],
  "verified_at": "2024-01-01T00:00:00Z",
  "transaction_count": 1000,
  "created_at": "2024-01-01T00:00:00Z"
}
```

**Field Mappings**:
- `address`: `keyword`
- `chain_id`: `integer`
- `name`: `text` + `keyword`
- `verification_status`: `keyword`
- `compiler_version`: `keyword`
- `source_code`: `text` (analyzed, indexed but not stored in full for large contracts)
- `abi`: `object` (nested, for structured queries)
- `verified_at`: `date`
- `transaction_count`: `long`
- `created_at`: `date`

**Searchable Fields**:
- Name (full-text search)
- Address (exact match)
- Source code (full-text search, limited)

## Indexing Pipeline

### Data Transformation

**Purpose**: Transform canonical PostgreSQL data into search-optimized documents.

**Transformation Steps**:
1. **Fetch Data**: Query PostgreSQL for entities to index
2. **Enrich Data**: Add computed fields (balances, counts, etc.)
3. **Normalize Data**: Normalize addresses, format values
4. **Index Document**: Send to Elasticsearch/OpenSearch

### Indexing Strategy

**Initial Indexing**:
- Bulk index existing data
- Process in batches (1000 documents per batch)
- Use bulk API for efficiency

**Incremental Indexing**:
- Index new entities as they're created
- Update entities when changed
- Delete entities when removed

**Update Frequency**:
- Real-time: Index immediately after database insert/update
- Batch: Bulk update every N minutes for efficiency

### Index Aliases

**Purpose**: Enable zero-downtime index updates.

**Strategy**:
- Write to new index (e.g., `blocks-138-v2`)
- Build index in background
- Switch alias when ready
- Delete old index after switch

**Alias Names**:
- `blocks-{chain_id}` → points to latest version
- `transactions-{chain_id}` → points to latest version
- etc.

## Query Patterns

### Full-Text Search

**Blocks Search**:
```json
{
  "query": {
    "match": {
      "hash": "0x123..."
    }
  }
}
```

**Address Search**:
```json
{
  "query": {
    "bool": {
      "should": [
        { "match": { "label": "wallet" } },
        { "prefix": { "address": "0x123" } }
      ]
    }
  }
}
```

**Token Search**:
```json
{
  "query": {
    "bool": {
      "should": [
        { "match": { "name": "My Token" } },
        { "match": { "symbol": "MTK" } }
      ]
    }
  }
}
```

### Faceted Search

**Filter by Multiple Criteria**:
```json
{
  "query": {
    "bool": {
      "must": [
        { "term": { "chain_id": 138 } },
        { "term": { "type": "ERC20" } },
        { "range": { "holder_count": { "gte": 100 } } }
      ]
    }
  },
  "aggs": {
    "by_type": {
      "terms": { "field": "type" }
    }
  }
}
```

### Unified Search

**Cross-Entity Search**:
- Search across blocks, transactions, addresses, tokens
- Use `_index` field to filter by entity type
- Combine results with relevance scoring

**Multi-Index Query**:
```json
{
  "query": {
    "multi_match": {
      "query": "0x123",
      "fields": ["hash", "address", "from_address", "to_address"],
      "type": "best_fields"
    }
  }
}
```

## Index Configuration

### Analysis Settings

**Custom Analyzer**:
- Address analyzer: Lowercase, no tokenization
- Symbol analyzer: Uppercase, no tokenization
- Text analyzer: Standard analyzer with lowercase

**Example Configuration**:
```json
{
  "settings": {
    "analysis": {
      "analyzer": {
        "address_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase"]
        }
      }
    }
  }
}
```

### Sharding and Replication

**Sharding**:
- Number of shards: Based on index size
- Large indices (> 50GB): Multiple shards
- Small indices: Single shard

**Replication**:
- Replica count: 1-2 (for high availability)
- Increase replicas for read-heavy workloads

## Performance Optimization

### Index Optimization

**Refresh Interval**:
- Default: 1 second
- For bulk indexing: Increase to 30 seconds, then reset

**Bulk Indexing**:
- Batch size: 1000-5000 documents
- Use bulk API
- Disable refresh during bulk indexing

### Query Optimization

**Query Caching**:
- Enable query cache for repeated queries
- Cache filter results

**Field Data**:
- Use `doc_values` for sorting/aggregations
- Avoid `fielddata` for text fields

## Maintenance

### Index Monitoring

**Metrics**:
- Index size
- Document count
- Query performance (p50, p95, p99)
- Index lag (time behind database)

### Index Cleanup

**Strategy**:
- Delete old indices (after alias switch)
- Archive old indices to cold storage
- Compress indices for storage efficiency

## Integration with PostgreSQL

### Data Sync

**Sync Strategy**:
- Real-time: Listen to database changes (CDC, triggers, or polling)
- Batch: Periodic sync jobs
- Hybrid: Real-time for recent data, batch for historical

**Change Detection**:
- Use `updated_at` timestamp
- Use database triggers to queue changes
- Use CDC (Change Data Capture) if available

### Consistency

**Eventual Consistency**:
- Search index is eventually consistent with database
- Small lag acceptable (< 1 minute)
- Critical queries can fall back to database

## References

- Database Schema: See `postgres-schema.md`
- Indexer Architecture: See `../indexing/indexer-architecture.md`
- Unified Search: See `../multichain/unified-search.md`
Add full monorepo: virtual-banker, backend, frontend, docs, scripts, deployment Co-authored-by: Cursor <cursoragent@cursor.com> 2026-02-10 11:32:49 -08:00			`# Search Index Schema Specification`

			`## Overview`

			`This document specifies the Elasticsearch/OpenSearch index schema for full-text search and faceted querying across blocks, transactions, addresses, tokens, and contracts.`

			`## Architecture`

			```mermaid
			`flowchart LR`
			`PG[(PostgreSQL<br/>Canonical Data)]`
			`Transform[Data Transformer]`
			`ES[(Elasticsearch<br/>Search Index)]`

			`PG --> Transform`
			`Transform --> ES`

			`Query[Search Query]`
			`Query --> ES`
			`ES --> Results[Search Results]`
			```

			`## Index Structure`

			`### Blocks Index`

			Index Name: `blocks-{chain_id}` (e.g., `blocks-138`)

			`Document Structure:`
			```json
			`{`
			`"block_number": 12345,`
			`"hash": "0x...",`
			`"timestamp": "2024-01-01T00:00:00Z",`
			`"miner": "0x...",`
			`"transaction_count": 100,`
			`"gas_used": 15000000,`
			`"gas_limit": 20000000,`
			`"chain_id": 138,`
			`"parent_hash": "0x...",`
			`"size": 1024`
			`}`
			```

			`Field Mappings:`
			- `block_number`: `long` (not analyzed, for sorting/filtering)
			- `hash`: `keyword` (exact match)
			- `timestamp`: `date`
			- `miner`: `keyword` (exact match)
			- `transaction_count`: `integer`
			- `gas_used`: `long`
			- `gas_limit`: `long`
			- `chain_id`: `integer`
			- `parent_hash`: `keyword`

			`Searchable Fields:`
			`- Hash (exact match)`
			`- Miner address (exact match)`

			`### Transactions Index`

			Index Name: `transactions-{chain_id}`

			`Document Structure:`
			```json
			`{`
			`"hash": "0x...",`
			`"block_number": 12345,`
			`"transaction_index": 5,`
			`"from_address": "0x...",`
			`"to_address": "0x...",`
			`"value": "1000000000000000000",`
			`"gas_price": "20000000000",`
			`"gas_used": 21000,`
			`"status": "success",`
			`"timestamp": "2024-01-01T00:00:00Z",`
			`"chain_id": 138,`
			`"input_data_length": 100,`
			`"is_contract_creation": false,`
			`"contract_address": null`
			`}`
			```

			`Field Mappings:`
			- `hash`: `keyword`
			- `block_number`: `long`
			- `transaction_index`: `integer`
			- `from_address`: `keyword`
			- `to_address`: `keyword`
			- `value`: `text` (for full-text search on large numbers)
			- `value_numeric`: `long` (for range queries)
			- `gas_price`: `long`
			- `gas_used`: `long`
			- `status`: `keyword`
			- `timestamp`: `date`
			- `chain_id`: `integer`
			- `input_data_length`: `integer`
			- `is_contract_creation`: `boolean`
			- `contract_address`: `keyword`

			`Searchable Fields:`
			`- Hash (exact match)`
			`- From/to addresses (exact match)`
			`- Value (range queries)`

			`### Addresses Index`

			Index Name: `addresses-{chain_id}`

			`Document Structure:`
			```json
			`{`
			`"address": "0x...",`
			`"chain_id": 138,`
			`"label": "My Wallet",`
			`"tags": ["wallet", "exchange"],`
			`"token_count": 10,`
			`"transaction_count": 500,`
			`"first_seen": "2024-01-01T00:00:00Z",`
			`"last_seen": "2024-01-15T00:00:00Z",`
			`"is_contract": true,`
			`"contract_name": "MyToken",`
			`"balance_eth": "1.5",`
			`"balance_usd": "3000"`
			`}`
			```

			`Field Mappings:`
			- `address`: `keyword`
			- `chain_id`: `integer`
			- `label`: `text` (analyzed) + `keyword` (exact match)
			- `tags`: `keyword` (array)
			- `token_count`: `integer`
			- `transaction_count`: `long`
			- `first_seen`: `date`
			- `last_seen`: `date`
			- `is_contract`: `boolean`
			- `contract_name`: `text` + `keyword`
			- `balance_eth`: `double`
			- `balance_usd`: `double`

			`Searchable Fields:`
			`- Address (exact match, prefix match)`
			`- Label (full-text search)`
			`- Contract name (full-text search)`
			`- Tags (facet filter)`

			`### Tokens Index`

			Index Name: `tokens-{chain_id}`

			`Document Structure:`
			```json
			`{`
			`"address": "0x...",`
			`"chain_id": 138,`
			`"name": "My Token",`
			`"symbol": "MTK",`
			`"type": "ERC20",`
			`"decimals": 18,`
			`"total_supply": "1000000000000000000000000",`
			`"holder_count": 1000,`
			`"transfer_count": 50000,`
			`"logo_url": "https://...",`
			`"verified": true,`
			`"description": "A token description"`
			`}`
			```

			`Field Mappings:`
			- `address`: `keyword`
			- `chain_id`: `integer`
			- `name`: `text` (analyzed) + `keyword` (exact match)
			- `symbol`: `keyword` (uppercase normalized)
			- `type`: `keyword`
			- `decimals`: `integer`
			- `total_supply`: `text` (for large numbers)
			- `total_supply_numeric`: `double` (for sorting)
			- `holder_count`: `integer`
			- `transfer_count`: `long`
			- `logo_url`: `keyword`
			- `verified`: `boolean`
			- `description`: `text` (analyzed)

			`Searchable Fields:`
			`- Name (full-text search)`
			`- Symbol (exact match, prefix match)`
			`- Address (exact match)`

			`### Contracts Index`

			Index Name: `contracts-{chain_id}`

			`Document Structure:`
			```json
			`{`
			`"address": "0x...",`
			`"chain_id": 138,`
			`"name": "MyContract",`
			`"verification_status": "verified",`
			`"compiler_version": "0.8.19",`
			`"source_code": "contract MyContract {...}",`
			`"abi": [...],`
			`"verified_at": "2024-01-01T00:00:00Z",`
			`"transaction_count": 1000,`
			`"created_at": "2024-01-01T00:00:00Z"`
			`}`
			```

			`Field Mappings:`
			- `address`: `keyword`
			- `chain_id`: `integer`
			- `name`: `text` + `keyword`
			- `verification_status`: `keyword`
			- `compiler_version`: `keyword`
			- `source_code`: `text` (analyzed, indexed but not stored in full for large contracts)
			- `abi`: `object` (nested, for structured queries)
			- `verified_at`: `date`
			- `transaction_count`: `long`
			- `created_at`: `date`

			`Searchable Fields:`
			`- Name (full-text search)`
			`- Address (exact match)`
			`- Source code (full-text search, limited)`

			`## Indexing Pipeline`

			`### Data Transformation`

			`Purpose: Transform canonical PostgreSQL data into search-optimized documents.`

			`Transformation Steps:`
			`1. Fetch Data: Query PostgreSQL for entities to index`
			`2. Enrich Data: Add computed fields (balances, counts, etc.)`
			`3. Normalize Data: Normalize addresses, format values`
			`4. Index Document: Send to Elasticsearch/OpenSearch`

			`### Indexing Strategy`

			`Initial Indexing:`
			`- Bulk index existing data`
			`- Process in batches (1000 documents per batch)`
			`- Use bulk API for efficiency`

			`Incremental Indexing:`
			`- Index new entities as they're created`
			`- Update entities when changed`
			`- Delete entities when removed`

			`Update Frequency:`
			`- Real-time: Index immediately after database insert/update`
			`- Batch: Bulk update every N minutes for efficiency`

			`### Index Aliases`

			`Purpose: Enable zero-downtime index updates.`

			`Strategy:`
			- Write to new index (e.g., `blocks-138-v2`)
			`- Build index in background`
			`- Switch alias when ready`
			`- Delete old index after switch`

			`Alias Names:`
			- `blocks-{chain_id}` → points to latest version
			- `transactions-{chain_id}` → points to latest version
			`- etc.`

			`## Query Patterns`

			`### Full-Text Search`

			`Blocks Search:`
			```json
			`{`
			`"query": {`
			`"match": {`
			`"hash": "0x123..."`
			`}`
			`}`
			`}`
			```

			`Address Search:`
			```json
			`{`
			`"query": {`
			`"bool": {`
			`"should": [`
			`{ "match": { "label": "wallet" } },`
			`{ "prefix": { "address": "0x123" } }`
			`]`
			`}`
			`}`
			`}`
			```

			`Token Search:`
			```json
			`{`
			`"query": {`
			`"bool": {`
			`"should": [`
			`{ "match": { "name": "My Token" } },`
			`{ "match": { "symbol": "MTK" } }`
			`]`
			`}`
			`}`
			`}`
			```

			`### Faceted Search`

			`Filter by Multiple Criteria:`
			```json
			`{`
			`"query": {`
			`"bool": {`
			`"must": [`
			`{ "term": { "chain_id": 138 } },`
			`{ "term": { "type": "ERC20" } },`
			`{ "range": { "holder_count": { "gte": 100 } } }`
			`]`
			`}`
			`},`
			`"aggs": {`
			`"by_type": {`
			`"terms": { "field": "type" }`
			`}`
			`}`
			`}`
			```

			`### Unified Search`

			`Cross-Entity Search:`
			`- Search across blocks, transactions, addresses, tokens`
			- Use `_index` field to filter by entity type
			`- Combine results with relevance scoring`

			`Multi-Index Query:`
			```json
			`{`
			`"query": {`
			`"multi_match": {`
			`"query": "0x123",`
			`"fields": ["hash", "address", "from_address", "to_address"],`
			`"type": "best_fields"`
			`}`
			`}`
			`}`
			```

			`## Index Configuration`

			`### Analysis Settings`

			`Custom Analyzer:`
			`- Address analyzer: Lowercase, no tokenization`
			`- Symbol analyzer: Uppercase, no tokenization`
			`- Text analyzer: Standard analyzer with lowercase`

			`Example Configuration:`
			```json
			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"address_analyzer": {`
			`"type": "custom",`
			`"tokenizer": "keyword",`
			`"filter": ["lowercase"]`
			`}`
			`}`
			`}`
			`}`
			`}`
			```

			`### Sharding and Replication`

			`Sharding:`
			`- Number of shards: Based on index size`
			`- Large indices (> 50GB): Multiple shards`
			`- Small indices: Single shard`

			`Replication:`
			`- Replica count: 1-2 (for high availability)`
			`- Increase replicas for read-heavy workloads`

			`## Performance Optimization`

			`### Index Optimization`

			`Refresh Interval:`
			`- Default: 1 second`
			`- For bulk indexing: Increase to 30 seconds, then reset`

			`Bulk Indexing:`
			`- Batch size: 1000-5000 documents`
			`- Use bulk API`
			`- Disable refresh during bulk indexing`

			`### Query Optimization`

			`Query Caching:`
			`- Enable query cache for repeated queries`
			`- Cache filter results`

			`Field Data:`
			- Use `doc_values` for sorting/aggregations
			- Avoid `fielddata` for text fields

			`## Maintenance`

			`### Index Monitoring`

			`Metrics:`
			`- Index size`
			`- Document count`
			`- Query performance (p50, p95, p99)`
			`- Index lag (time behind database)`

			`### Index Cleanup`

			`Strategy:`
			`- Delete old indices (after alias switch)`
			`- Archive old indices to cold storage`
			`- Compress indices for storage efficiency`

			`## Integration with PostgreSQL`

			`### Data Sync`

			`Sync Strategy:`
			`- Real-time: Listen to database changes (CDC, triggers, or polling)`
			`- Batch: Periodic sync jobs`
			`- Hybrid: Real-time for recent data, batch for historical`

			`Change Detection:`
			- Use `updated_at` timestamp
			`- Use database triggers to queue changes`
			`- Use CDC (Change Data Capture) if available`

			`### Consistency`

			`Eventual Consistency:`
			`- Search index is eventually consistent with database`
			`- Small lag acceptable (< 1 minute)`
			`- Critical queries can fall back to database`

			`## References`

			- Database Schema: See `postgres-schema.md`
			- Indexer Architecture: See `../indexing/indexer-architecture.md`
			- Unified Search: See `../multichain/unified-search.md`