explorer-monorepo/docs/specs/database/data-lake-schema.md

# Data Lake Schema Specification

## Overview

This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.

## Storage Structure

### Directory Layout

```
s3://explorer-data-lake/
├── raw/
│   ├── chain_id=138/
│   │   ├── year=2024/
│   │   │   ├── month=01/
│   │   │   │   ├── day=01/
│   │   │   │   │   ├── blocks.parquet
│   │   │   │   │   ├── transactions.parquet
│   │   │   │   │   └── logs.parquet
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── ...
│   └── ...
├── processed/
│   ├── chain_id=138/
│   │   ├── daily_aggregates/
│   │   │   ├── year=2024/
│   │   │   │   └── month=01/
│   │   │   │       └── day=01.parquet
│   │   └── ...
│   └── ...
└── archived/
    └── ...
```

### Partitioning Strategy

**Partition Keys**:
- `chain_id`: Chain identifier
- `year`: Year (YYYY)
- `month`: Month (MM)
- `day`: Day (DD)

**Benefits**:
- Efficient query pruning
- Parallel processing
- Easy data management (delete by partition)

## Parquet Schema

### Blocks Parquet Schema

```json
{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "number", "type": "long", "nullable": false},
    {"name": "hash", "type": "string", "nullable": false},
    {"name": "parent_hash", "type": "string", "nullable": false},
    {"name": "timestamp", "type": "timestamp", "nullable": false},
    {"name": "miner", "type": "string", "nullable": true},
    {"name": "gas_used", "type": "long", "nullable": true},
    {"name": "gas_limit", "type": "long", "nullable": true},
    {"name": "transaction_count", "type": "integer", "nullable": true},
    {"name": "size", "type": "integer", "nullable": true}
  ]
}
```

### Transactions Parquet Schema

```json
{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "transaction_index", "type": "integer", "nullable": false},
    {"name": "from_address", "type": "string", "nullable": false},
    {"name": "to_address", "type": "string", "nullable": true},
    {"name": "value", "type": "string", "nullable": false}, // Decimal as string
    {"name": "gas_price", "type": "long", "nullable": true},
    {"name": "gas_used", "type": "long", "nullable": true},
    {"name": "gas_limit", "type": "long", "nullable": false},
    {"name": "status", "type": "integer", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}
```

### Logs Parquet Schema

```json
{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "transaction_hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "log_index", "type": "integer", "nullable": false},
    {"name": "address", "type": "string", "nullable": false},
    {"name": "topic0", "type": "string", "nullable": true},
    {"name": "topic1", "type": "string", "nullable": true},
    {"name": "topic2", "type": "string", "nullable": true},
    {"name": "topic3", "type": "string", "nullable": true},
    {"name": "data", "type": "string", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}
```

### Token Transfers Parquet Schema

```json
{
  "type": "struct",
  "fields": [
    {"name": "chain_id", "type": "integer", "nullable": false},
    {"name": "transaction_hash", "type": "string", "nullable": false},
    {"name": "block_number", "type": "long", "nullable": false},
    {"name": "token_address", "type": "string", "nullable": false},
    {"name": "token_type", "type": "string", "nullable": false},
    {"name": "from_address", "type": "string", "nullable": false},
    {"name": "to_address", "type": "string", "nullable": false},
    {"name": "amount", "type": "string", "nullable": true},
    {"name": "token_id", "type": "string", "nullable": true},
    {"name": "timestamp", "type": "timestamp", "nullable": false}
  ]
}
```

## Data Ingestion

### ETL Pipeline

**Process**:
1. Extract: Query PostgreSQL for daily data
2. Transform: Convert to Parquet format
3. Load: Upload to S3 with partitioning

**Schedule**: Daily batch job after day ends

**Tools**: Apache Spark, AWS Glue, or custom ETL scripts

### Compression

**Format**: Snappy compression (good balance of speed and compression ratio)

**Alternative**: Gzip (better compression, slower)

### File Sizing

**Target Size**: 100-500 MB per Parquet file
- Smaller files: Better parallelism
- Larger files: Better compression

**Strategy**: Write files of target size, or split by time ranges

## Query Interface

### AWS Athena / Presto

**Table Definition**:
```sql
CREATE EXTERNAL TABLE blocks_138 (
  chain_id int,
  number bigint,
  hash string,
  parent_hash string,
  timestamp timestamp,
  miner string,
  gas_used bigint,
  gas_limit bigint,
  transaction_count int,
  size int
)
STORED AS PARQUET
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2020,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31'
);
```

### Query Examples

**Daily Transaction Count**:
```sql
SELECT
  DATE(timestamp) as date,
  COUNT(*) as transaction_count
FROM transactions_138
WHERE year = 2024 AND month = 1
GROUP BY DATE(timestamp)
ORDER BY date;
```

**Token Transfer Analytics**:
```sql
SELECT
  token_address,
  COUNT(*) as transfer_count,
  SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
FROM token_transfers_138
WHERE year = 2024 AND month = 1
GROUP BY token_address
ORDER BY total_volume DESC
LIMIT 100;
```

## Data Retention

### Retention Policies

**Raw Data**: 7 years (compliance requirement)
**Processed Aggregates**: Indefinite
**Archived Data**: Move to Glacier after 1 year

### Lifecycle Policies

**S3 Lifecycle Rules**:
1. Move to Infrequent Access after 30 days
2. Move to Glacier after 1 year
3. Delete after 7 years (raw data)

## Data Processing

### Aggregation Jobs

**Daily Aggregates**:
- Transaction counts by hour
- Gas usage statistics
- Token transfer volumes
- Address activity metrics

**Monthly Aggregates**:
- Network growth metrics
- Token distribution changes
- Protocol usage statistics

### ML/Analytics Workflows

**Use Cases**:
- Anomaly detection
- Fraud detection
- Market analysis
- Network health monitoring

**Tools**: Spark, Pandas, Jupyter notebooks

## Security and Access Control

### Access Control

**IAM Policies**: Restrict access to specific prefixes
**Encryption**: Server-side encryption (SSE-S3 or SSE-KMS)
**Audit Logging**: Enable S3 access logging

### Data Classification

**Public Data**: Blocks, transactions (public blockchain data)
**Sensitive Data**: User addresses, labels (requires authentication)
**Compliance Data**: Banking/transaction data (strict access control)

## Cost Optimization

### Storage Optimization

**Strategies**:
- Use appropriate storage classes (Standard, IA, Glacier)
- Compress data (Parquet + Snappy)
- Delete old data per retention policy
- Use intelligent tiering

### Query Optimization

**Strategies**:
- Partition pruning (query only relevant partitions)
- Column pruning (select only needed columns)
- Predicate pushdown (filter early)

## References

- Database Schema: See `postgres-schema.md`
- Analytics: See `../observability/metrics-monitoring.md`