explorer-monorepo/docs/specs/indexing/indexer-architecture.md

# Indexer Architecture Specification

## Overview

This document specifies the architecture for the blockchain indexing pipeline that ingests, processes, and stores blockchain data from ChainID 138 and other supported chains. The indexer is responsible for maintaining a complete, queryable database of blocks, transactions, logs, traces, and token transfers.

## Architecture

```mermaid
flowchart TB
    subgraph Input[Input Layer]
        Node[RPC Node<br/>ChainID 138]
        WS[WebSocket<br/>New Block Events]
    end

    subgraph Ingest[Ingestion Layer]
        BL[Block Listener<br/>Real-time]
        BW[Backfill Worker<br/>Historical]
        Q[Message Queue<br/>Kafka/RabbitMQ]
    end

    subgraph Process[Processing Layer]
        BP[Block Processor]
        TP[Transaction Processor]
        LP[Log Processor]
        TrP[Trace Processor]
        TokenP[Token Transfer Processor]
    end

    subgraph Decode[Decoding Layer]
        ABI[ABI Registry]
        SigDB[Signature Database]
        Decoder[Event Decoder]
    end

    subgraph Persist[Persistence Layer]
        PG[(PostgreSQL<br/>Canonical Data)]
        ES[(Elasticsearch<br/>Search Index)]
        TS[(TimescaleDB<br/>Metrics)]
    end

    subgraph Materialize[Materialization Layer]
        Agg[Aggregator<br/>TPS, Gas Stats]
        Cache[Cache Layer<br/>Redis]
    end

    Node --> BL
    Node --> BW
    WS --> BL

    BL --> Q
    BW --> Q

    Q --> BP
    BP --> TP
    BP --> LP
    BP --> TrP

    TP --> TokenP
    LP --> Decoder
    Decoder --> ABI
    Decoder --> SigDB

    BP --> PG
    TP --> PG
    LP --> PG
    TrP --> PG
    TokenP --> PG

    BP --> ES
    TP --> ES
    LP --> ES

    BP --> TS
    TP --> TS

    PG --> Agg
    Agg --> Cache
```

## Block Ingestion Pipeline

### Block Listener (Real-time)

**Purpose**: Monitor blockchain for new blocks and ingest them immediately.

**Implementation**:
- Subscribe to `newHeads` via WebSocket
- Poll `eth_blockNumber` as fallback (every 2 seconds)
- Handle WebSocket reconnection automatically

**Flow**:
1. Receive block header event
2. Fetch full block data via `eth_getBlockByNumber`
3. Enqueue block to processing queue
4. Acknowledge receipt

**Error Handling**:
- Retry on network errors (exponential backoff)
- Handle reorgs (see reorg handling section)
- Log errors for monitoring

### Backfill Worker (Historical)

**Purpose**: Index historical blocks from genesis or a specific starting point.

**Implementation**:
- Parallel workers for faster indexing
- Configurable batch size (e.g., 100 blocks per batch)
- Rate limiting to avoid overloading RPC node
- Checkpoint system for resuming interrupted backfills

**Flow**:
1. Determine starting block (checkpoint or genesis)
2. Fetch batch of blocks
3. Enqueue each block to processing queue
4. Update checkpoint
5. Repeat until caught up with chain head

**Optimization Strategies**:
- Parallel workers process different block ranges
- Skip blocks already indexed (idempotent processing)
- Batch RPC requests where possible

### Message Queue

**Purpose**: Decouple ingestion from processing, enable scaling, ensure durability.

**Technology**: Kafka or RabbitMQ

**Topics/Queues**:
- `blocks`: New blocks to process
- `transactions`: Transactions to decode
- `traces`: Traces to process (async)

**Configuration**:
- Durability: Persistent storage
- Replication: 3 replicas for high availability
- Partitioning: By chain_id and block number (for ordering)

## Transaction Processing Flow

### Block Processing

**Steps**:
1. **Validate Block**: Verify block hash, parent hash, block number
2. **Extract Transactions**: Get transaction list from block
3. **Fetch Receipts**: Get transaction receipts for all transactions
4. **Process Each Transaction**:
   - Store transaction data
   - Process receipt (logs, status)
   - Extract token transfers (ERC-20/721/1155)
   - Link to contract interactions

**Data Extracted**:
- Transaction fields (hash, from, to, value, gas, etc.)
- Receipt fields (status, gasUsed, logs, etc.)
- Contract creation detection
- Token transfer events

### Transaction Decoding

**Purpose**: Decode event logs and transaction data using ABIs.

**Process**:
1. Identify contract address (to field or created address)
2. Look up ABI in registry (verified contracts)
3. Decode function calls and events
4. Store decoded data for search and filtering

**Fallback Strategies**:
- Signature database for unknown functions/events (4-byte signatures)
- Heuristic detection for common patterns (Transfer events)
- Store raw data when decoding fails

### ABI Registry

**Purpose**: Store contract ABIs for decoding transactions and events.

**Data Sources**:
- Contract verification submissions
- Sourcify integration
- Public ABI repositories (4byte.directory, etc.)

**Storage**:
- Database table: `contract_abis`
- Cache layer: Redis for frequently accessed ABIs
- Versioning: Support multiple ABI versions per contract

**Schema**:
```sql
contract_abis (
    id UUID PRIMARY KEY,
    chain_id INTEGER NOT NULL,
    address VARCHAR(42) NOT NULL,
    abi JSONB NOT NULL,
    verified BOOLEAN DEFAULT false,
    source VARCHAR(50), -- 'verification', 'sourcify', 'public'
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    UNIQUE(chain_id, address)
)
```

### Signature Database

**Purpose**: Map 4-byte function signatures and 32-byte event signatures to function/event names.

**Data Sources**:
- Public signature databases (4byte.directory)
- User submissions
- Automatic extraction from verified contracts

**Usage**:
- Lookup function name from signature (e.g., `0x095ea7b3` → `approve(address,uint256)`)
- Lookup event name from topic[0] (e.g., `0xddf252...` → `Transfer(address,address,uint256)`)
- Partial decoding when full ABI unavailable

## Event Log Indexing

### Log Processing

**Purpose**: Index event logs for efficient querying and filtering.

**Process**:
1. Extract logs from transaction receipts
2. Decode log topics and data using ABI
3. Index by:
   - Contract address
   - Event signature (topic[0])
   - Indexed parameters (topic[1..3])
   - Block number and transaction hash
   - Log index

**Indexing Strategy**:
- PostgreSQL table: `logs` with indexes on (address, topic0, block_number)
- Elasticsearch index: Full-text search on decoded event data
- Time-series: Aggregate log counts per contract/event

### Event Decoding

**Decoding Flow**:
1. Identify event signature from topic[0]
2. Look up event definition in ABI registry
3. Decode indexed parameters (topics 1-3)
4. Decode non-indexed parameters (data field)
5. Store decoded parameters as JSONB

**Common Events to Index**:
- ERC-20: `Transfer(address,address,uint256)`
- ERC-721: `Transfer(address,address,uint256)`
- ERC-1155: `TransferSingle`, `TransferBatch`
- Approval events: `Approval(address,address,uint256)`

## Trace Processing

### Call Trace Extraction

**Purpose**: Extract detailed call traces for transaction debugging and internal transaction tracking.

**Trace Types**:
- `call`: Contract calls
- `create`: Contract creation
- `suicide`: Contract self-destruct
- `delegatecall`: Delegate calls

**Process**:
1. Request trace via `trace_transaction` or `trace_block`
2. Parse trace result structure
3. Extract:
   - Call hierarchy (parent-child relationships)
   - Internal transactions (value transfers)
   - Gas usage per call
   - Revert information

### Internal Transaction Tracking

**Purpose**: Track value transfers that occur inside transactions (not just top-level).

**Data Extracted**:
- From address (caller)
- To address (callee)
- Value transferred
- Call type (call, delegatecall, etc.)
- Success/failure status
- Gas used

**Storage**:
- Separate table: `internal_transactions`
- Link to parent transaction via `transaction_hash`
- Link to parent call via `trace_address` array

## Token Transfer Extraction

### ERC-20 Transfer Detection

**Detection Method**:
1. Look for `Transfer(address,address,uint256)` event
2. Decode event parameters (from, to, value)
3. Store in `token_transfers` table
4. Update token holder balances

**Data Stored**:
- Token contract address
- From address
- To address
- Amount (with decimals)
- Block number
- Transaction hash
- Log index

### ERC-721 Transfer Detection

**Similar to ERC-20 but**:
- Token ID is tracked (unique NFT)
- Transfer can be from zero address (mint) or to zero address (burn)

### ERC-1155 Transfer Detection

**Events**:
- `TransferSingle`: Single token transfer
- `TransferBatch`: Batch token transfer

**Challenges**:
- Multiple token IDs and amounts per transfer
- Batch operations require array decoding

### Token Holder Tracking

**Purpose**: Maintain list of addresses holding each token.

**Strategy**:
- Real-time updates: Update on each transfer
- Periodic reconciliation: Verify balances via RPC
- Balance snapshots: Store balance at each block (for historical queries)

## Indexer Worker Scaling and Partitioning

### Horizontal Scaling

**Strategy**: Multiple indexer workers processing different blocks/chains.

**Partitioning Methods**:
1. **By Chain**: Each worker handles one chain
2. **By Block Range**: Workers split block ranges (for backfill)
3. **By Processing Stage**: Separate workers for blocks, traces, token transfers

### Worker Coordination

**Mechanisms**:
- Message queue: Workers consume from shared queue
- Database locks: Prevent duplicate processing
- Leader election: For single-worker tasks (reorg handling)

### Load Balancing

**Distribution**:
- Round-robin for backfill workers
- Sticky sessions for chain-specific workers
- Priority queuing: Real-time blocks before historical blocks

### Performance Targets

**Throughput**:
- Process 100 blocks/minute per worker
- Process 1000 transactions/minute per worker
- Process 100 traces/minute per worker (trace operations are slower)

**Latency**:
- Real-time blocks: Indexed within 5 seconds of block production
- Historical blocks: Catch up to chain head within reasonable time

## Data Consistency

### Transaction Isolation

**Strategy**: Process blocks atomically (all or nothing).

**Implementation**:
- Database transactions for block-level operations
- Idempotent processing (can safely retry)
- Checkpoint system to track last processed block

### Idempotency

**Requirements**:
- Processing same block multiple times should not create duplicates
- Use unique constraints in database
- Upsert operations where applicable

## Error Handling and Retry Logic

### Error Types

1. **Transient Errors**: Network issues, temporary RPC failures
   - Retry with exponential backoff
   - Max retries: 10
   - Max backoff: 5 minutes

2. **Permanent Errors**: Invalid data, unsupported features
   - Log error and skip
   - Alert for investigation

3. **Reorg Errors**: Block replaced by different block
   - Handle via reorg detection (see reorg handling spec)

### Retry Strategy

**Exponential Backoff**:
- Initial delay: 1 second
- Multiplier: 2x
- Max delay: 5 minutes
- Jitter: Random ±20% to avoid thundering herd

## Monitoring and Observability

### Key Metrics

**Throughput**:
- Blocks processed per minute
- Transactions processed per minute
- Logs indexed per minute

**Latency**:
- Time from block production to index completion
- Time to process block (p50, p95, p99)

**Lag**:
- Block height lag (current block - last indexed block)
- Time lag (current time - last indexed block time)

**Errors**:
- Error rate by type
- Retry count
- Failed blocks

### Alerting Rules

- Block lag > 10 blocks: Warning
- Block lag > 100 blocks: Critical
- Error rate > 1%: Warning
- Error rate > 5%: Critical
- Worker down: Critical

## Integration Points

### RPC Node Integration

- See `../infrastructure/node-rpc-architecture.md`
- Connection pooling
- Rate limiting awareness
- Failover handling

### Database Integration

- See `../database/postgres-schema.md`
- Connection pooling
- Batch inserts for performance
- Transaction management

### Search Integration

- See `../database/search-index-schema.md`
- Async indexing to Elasticsearch
- Bulk indexing for efficiency

## Implementation Guidelines

### Technology Stack

**Recommended**:
- **Language**: Go, Rust, or Python (performance considerations)
- **Queue**: Kafka (high throughput) or RabbitMQ (simpler setup)
- **Database**: PostgreSQL with connection pooling
- **Caching**: Redis for frequently accessed data

### Code Structure

```
indexer/
├── cmd/
│   ├── block-listener/      # Real-time block listener
│   ├── backfill-worker/     # Historical indexing worker
│   └── processor/           # Block/transaction processor
├── internal/
│   ├── ingestion/           # Ingestion logic
│   ├── processing/          # Processing logic
│   ├── decoding/            # ABI/signature decoding
│   └── persistence/         # Database operations
└── pkg/
    ├── abi/                 # ABI registry
    └── rpc/                 # RPC client
```

### Testing Strategy

**Unit Tests**:
- Decoding logic
- Data transformation
- Error handling

**Integration Tests**:
- End-to-end block processing
- Database operations
- Queue integration

**Load Tests**:
- Process historical blocks
- Simulate high block production rate
- Test worker scaling

## References

- Data Models: See `data-models.md`
- Reorg Handling: See `reorg-handling.md`
- Database Schema: See `../database/postgres-schema.md`
- RPC Architecture: See `../infrastructure/node-rpc-architecture.md`