Files
explorer-monorepo/docs/specs/indexing/indexer-architecture.md

13 KiB

Indexer Architecture Specification

Overview

This document specifies the architecture for the blockchain indexing pipeline that ingests, processes, and stores blockchain data from ChainID 138 and other supported chains. The indexer is responsible for maintaining a complete, queryable database of blocks, transactions, logs, traces, and token transfers.

Architecture

flowchart TB
    subgraph Input[Input Layer]
        Node[RPC Node<br/>ChainID 138]
        WS[WebSocket<br/>New Block Events]
    end
    
    subgraph Ingest[Ingestion Layer]
        BL[Block Listener<br/>Real-time]
        BW[Backfill Worker<br/>Historical]
        Q[Message Queue<br/>Kafka/RabbitMQ]
    end
    
    subgraph Process[Processing Layer]
        BP[Block Processor]
        TP[Transaction Processor]
        LP[Log Processor]
        TrP[Trace Processor]
        TokenP[Token Transfer Processor]
    end
    
    subgraph Decode[Decoding Layer]
        ABI[ABI Registry]
        SigDB[Signature Database]
        Decoder[Event Decoder]
    end
    
    subgraph Persist[Persistence Layer]
        PG[(PostgreSQL<br/>Canonical Data)]
        ES[(Elasticsearch<br/>Search Index)]
        TS[(TimescaleDB<br/>Metrics)]
    end
    
    subgraph Materialize[Materialization Layer]
        Agg[Aggregator<br/>TPS, Gas Stats]
        Cache[Cache Layer<br/>Redis]
    end
    
    Node --> BL
    Node --> BW
    WS --> BL
    
    BL --> Q
    BW --> Q
    
    Q --> BP
    BP --> TP
    BP --> LP
    BP --> TrP
    
    TP --> TokenP
    LP --> Decoder
    Decoder --> ABI
    Decoder --> SigDB
    
    BP --> PG
    TP --> PG
    LP --> PG
    TrP --> PG
    TokenP --> PG
    
    BP --> ES
    TP --> ES
    LP --> ES
    
    BP --> TS
    TP --> TS
    
    PG --> Agg
    Agg --> Cache

Block Ingestion Pipeline

Block Listener (Real-time)

Purpose: Monitor blockchain for new blocks and ingest them immediately.

Implementation:

  • Subscribe to newHeads via WebSocket
  • Poll eth_blockNumber as fallback (every 2 seconds)
  • Handle WebSocket reconnection automatically

Flow:

  1. Receive block header event
  2. Fetch full block data via eth_getBlockByNumber
  3. Enqueue block to processing queue
  4. Acknowledge receipt

Error Handling:

  • Retry on network errors (exponential backoff)
  • Handle reorgs (see reorg handling section)
  • Log errors for monitoring

Backfill Worker (Historical)

Purpose: Index historical blocks from genesis or a specific starting point.

Implementation:

  • Parallel workers for faster indexing
  • Configurable batch size (e.g., 100 blocks per batch)
  • Rate limiting to avoid overloading RPC node
  • Checkpoint system for resuming interrupted backfills

Flow:

  1. Determine starting block (checkpoint or genesis)
  2. Fetch batch of blocks
  3. Enqueue each block to processing queue
  4. Update checkpoint
  5. Repeat until caught up with chain head

Optimization Strategies:

  • Parallel workers process different block ranges
  • Skip blocks already indexed (idempotent processing)
  • Batch RPC requests where possible

Message Queue

Purpose: Decouple ingestion from processing, enable scaling, ensure durability.

Technology: Kafka or RabbitMQ

Topics/Queues:

  • blocks: New blocks to process
  • transactions: Transactions to decode
  • traces: Traces to process (async)

Configuration:

  • Durability: Persistent storage
  • Replication: 3 replicas for high availability
  • Partitioning: By chain_id and block number (for ordering)

Transaction Processing Flow

Block Processing

Steps:

  1. Validate Block: Verify block hash, parent hash, block number
  2. Extract Transactions: Get transaction list from block
  3. Fetch Receipts: Get transaction receipts for all transactions
  4. Process Each Transaction:
    • Store transaction data
    • Process receipt (logs, status)
    • Extract token transfers (ERC-20/721/1155)
    • Link to contract interactions

Data Extracted:

  • Transaction fields (hash, from, to, value, gas, etc.)
  • Receipt fields (status, gasUsed, logs, etc.)
  • Contract creation detection
  • Token transfer events

Transaction Decoding

Purpose: Decode event logs and transaction data using ABIs.

Process:

  1. Identify contract address (to field or created address)
  2. Look up ABI in registry (verified contracts)
  3. Decode function calls and events
  4. Store decoded data for search and filtering

Fallback Strategies:

  • Signature database for unknown functions/events (4-byte signatures)
  • Heuristic detection for common patterns (Transfer events)
  • Store raw data when decoding fails

ABI Registry

Purpose: Store contract ABIs for decoding transactions and events.

Data Sources:

  • Contract verification submissions
  • Sourcify integration
  • Public ABI repositories (4byte.directory, etc.)

Storage:

  • Database table: contract_abis
  • Cache layer: Redis for frequently accessed ABIs
  • Versioning: Support multiple ABI versions per contract

Schema:

contract_abis (
    id UUID PRIMARY KEY,
    chain_id INTEGER NOT NULL,
    address VARCHAR(42) NOT NULL,
    abi JSONB NOT NULL,
    verified BOOLEAN DEFAULT false,
    source VARCHAR(50), -- 'verification', 'sourcify', 'public'
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    UNIQUE(chain_id, address)
)

Signature Database

Purpose: Map 4-byte function signatures and 32-byte event signatures to function/event names.

Data Sources:

  • Public signature databases (4byte.directory)
  • User submissions
  • Automatic extraction from verified contracts

Usage:

  • Lookup function name from signature (e.g., 0x095ea7b3approve(address,uint256))
  • Lookup event name from topic[0] (e.g., 0xddf252...Transfer(address,address,uint256))
  • Partial decoding when full ABI unavailable

Event Log Indexing

Log Processing

Purpose: Index event logs for efficient querying and filtering.

Process:

  1. Extract logs from transaction receipts
  2. Decode log topics and data using ABI
  3. Index by:
    • Contract address
    • Event signature (topic[0])
    • Indexed parameters (topic[1..3])
    • Block number and transaction hash
    • Log index

Indexing Strategy:

  • PostgreSQL table: logs with indexes on (address, topic0, block_number)
  • Elasticsearch index: Full-text search on decoded event data
  • Time-series: Aggregate log counts per contract/event

Event Decoding

Decoding Flow:

  1. Identify event signature from topic[0]
  2. Look up event definition in ABI registry
  3. Decode indexed parameters (topics 1-3)
  4. Decode non-indexed parameters (data field)
  5. Store decoded parameters as JSONB

Common Events to Index:

  • ERC-20: Transfer(address,address,uint256)
  • ERC-721: Transfer(address,address,uint256)
  • ERC-1155: TransferSingle, TransferBatch
  • Approval events: Approval(address,address,uint256)

Trace Processing

Call Trace Extraction

Purpose: Extract detailed call traces for transaction debugging and internal transaction tracking.

Trace Types:

  • call: Contract calls
  • create: Contract creation
  • suicide: Contract self-destruct
  • delegatecall: Delegate calls

Process:

  1. Request trace via trace_transaction or trace_block
  2. Parse trace result structure
  3. Extract:
    • Call hierarchy (parent-child relationships)
    • Internal transactions (value transfers)
    • Gas usage per call
    • Revert information

Internal Transaction Tracking

Purpose: Track value transfers that occur inside transactions (not just top-level).

Data Extracted:

  • From address (caller)
  • To address (callee)
  • Value transferred
  • Call type (call, delegatecall, etc.)
  • Success/failure status
  • Gas used

Storage:

  • Separate table: internal_transactions
  • Link to parent transaction via transaction_hash
  • Link to parent call via trace_address array

Token Transfer Extraction

ERC-20 Transfer Detection

Detection Method:

  1. Look for Transfer(address,address,uint256) event
  2. Decode event parameters (from, to, value)
  3. Store in token_transfers table
  4. Update token holder balances

Data Stored:

  • Token contract address
  • From address
  • To address
  • Amount (with decimals)
  • Block number
  • Transaction hash
  • Log index

ERC-721 Transfer Detection

Similar to ERC-20 but:

  • Token ID is tracked (unique NFT)
  • Transfer can be from zero address (mint) or to zero address (burn)

ERC-1155 Transfer Detection

Events:

  • TransferSingle: Single token transfer
  • TransferBatch: Batch token transfer

Challenges:

  • Multiple token IDs and amounts per transfer
  • Batch operations require array decoding

Token Holder Tracking

Purpose: Maintain list of addresses holding each token.

Strategy:

  • Real-time updates: Update on each transfer
  • Periodic reconciliation: Verify balances via RPC
  • Balance snapshots: Store balance at each block (for historical queries)

Indexer Worker Scaling and Partitioning

Horizontal Scaling

Strategy: Multiple indexer workers processing different blocks/chains.

Partitioning Methods:

  1. By Chain: Each worker handles one chain
  2. By Block Range: Workers split block ranges (for backfill)
  3. By Processing Stage: Separate workers for blocks, traces, token transfers

Worker Coordination

Mechanisms:

  • Message queue: Workers consume from shared queue
  • Database locks: Prevent duplicate processing
  • Leader election: For single-worker tasks (reorg handling)

Load Balancing

Distribution:

  • Round-robin for backfill workers
  • Sticky sessions for chain-specific workers
  • Priority queuing: Real-time blocks before historical blocks

Performance Targets

Throughput:

  • Process 100 blocks/minute per worker
  • Process 1000 transactions/minute per worker
  • Process 100 traces/minute per worker (trace operations are slower)

Latency:

  • Real-time blocks: Indexed within 5 seconds of block production
  • Historical blocks: Catch up to chain head within reasonable time

Data Consistency

Transaction Isolation

Strategy: Process blocks atomically (all or nothing).

Implementation:

  • Database transactions for block-level operations
  • Idempotent processing (can safely retry)
  • Checkpoint system to track last processed block

Idempotency

Requirements:

  • Processing same block multiple times should not create duplicates
  • Use unique constraints in database
  • Upsert operations where applicable

Error Handling and Retry Logic

Error Types

  1. Transient Errors: Network issues, temporary RPC failures

    • Retry with exponential backoff
    • Max retries: 10
    • Max backoff: 5 minutes
  2. Permanent Errors: Invalid data, unsupported features

    • Log error and skip
    • Alert for investigation
  3. Reorg Errors: Block replaced by different block

    • Handle via reorg detection (see reorg handling spec)

Retry Strategy

Exponential Backoff:

  • Initial delay: 1 second
  • Multiplier: 2x
  • Max delay: 5 minutes
  • Jitter: Random ±20% to avoid thundering herd

Monitoring and Observability

Key Metrics

Throughput:

  • Blocks processed per minute
  • Transactions processed per minute
  • Logs indexed per minute

Latency:

  • Time from block production to index completion
  • Time to process block (p50, p95, p99)

Lag:

  • Block height lag (current block - last indexed block)
  • Time lag (current time - last indexed block time)

Errors:

  • Error rate by type
  • Retry count
  • Failed blocks

Alerting Rules

  • Block lag > 10 blocks: Warning
  • Block lag > 100 blocks: Critical
  • Error rate > 1%: Warning
  • Error rate > 5%: Critical
  • Worker down: Critical

Integration Points

RPC Node Integration

  • See ../infrastructure/node-rpc-architecture.md
  • Connection pooling
  • Rate limiting awareness
  • Failover handling

Database Integration

  • See ../database/postgres-schema.md
  • Connection pooling
  • Batch inserts for performance
  • Transaction management

Search Integration

  • See ../database/search-index-schema.md
  • Async indexing to Elasticsearch
  • Bulk indexing for efficiency

Implementation Guidelines

Technology Stack

Recommended:

  • Language: Go, Rust, or Python (performance considerations)
  • Queue: Kafka (high throughput) or RabbitMQ (simpler setup)
  • Database: PostgreSQL with connection pooling
  • Caching: Redis for frequently accessed data

Code Structure

indexer/
├── cmd/
│   ├── block-listener/      # Real-time block listener
│   ├── backfill-worker/     # Historical indexing worker
│   └── processor/           # Block/transaction processor
├── internal/
│   ├── ingestion/           # Ingestion logic
│   ├── processing/          # Processing logic
│   ├── decoding/            # ABI/signature decoding
│   └── persistence/         # Database operations
└── pkg/
    ├── abi/                 # ABI registry
    └── rpc/                 # RPC client

Testing Strategy

Unit Tests:

  • Decoding logic
  • Data transformation
  • Error handling

Integration Tests:

  • End-to-end block processing
  • Database operations
  • Queue integration

Load Tests:

  • Process historical blocks
  • Simulate high block production rate
  • Test worker scaling

References

  • Data Models: See data-models.md
  • Reorg Handling: See reorg-handling.md
  • Database Schema: See ../database/postgres-schema.md
  • RPC Architecture: See ../infrastructure/node-rpc-architecture.md