# Data Lake Schema Specification ## Overview This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes. ## Storage Structure ### Directory Layout ``` s3://explorer-data-lake/ ├── raw/ │ ├── chain_id=138/ │ │ ├── year=2024/ │ │ │ ├── month=01/ │ │ │ │ ├── day=01/ │ │ │ │ │ ├── blocks.parquet │ │ │ │ │ ├── transactions.parquet │ │ │ │ │ └── logs.parquet │ │ │ │ └── ... │ │ │ └── ... │ │ └── ... │ └── ... ├── processed/ │ ├── chain_id=138/ │ │ ├── daily_aggregates/ │ │ │ ├── year=2024/ │ │ │ │ └── month=01/ │ │ │ │ └── day=01.parquet │ │ └── ... │ └── ... └── archived/ └── ... ``` ### Partitioning Strategy **Partition Keys**: - `chain_id`: Chain identifier - `year`: Year (YYYY) - `month`: Month (MM) - `day`: Day (DD) **Benefits**: - Efficient query pruning - Parallel processing - Easy data management (delete by partition) ## Parquet Schema ### Blocks Parquet Schema ```json { "type": "struct", "fields": [ {"name": "chain_id", "type": "integer", "nullable": false}, {"name": "number", "type": "long", "nullable": false}, {"name": "hash", "type": "string", "nullable": false}, {"name": "parent_hash", "type": "string", "nullable": false}, {"name": "timestamp", "type": "timestamp", "nullable": false}, {"name": "miner", "type": "string", "nullable": true}, {"name": "gas_used", "type": "long", "nullable": true}, {"name": "gas_limit", "type": "long", "nullable": true}, {"name": "transaction_count", "type": "integer", "nullable": true}, {"name": "size", "type": "integer", "nullable": true} ] } ``` ### Transactions Parquet Schema ```json { "type": "struct", "fields": [ {"name": "chain_id", "type": "integer", "nullable": false}, {"name": "hash", "type": "string", "nullable": false}, {"name": "block_number", "type": "long", "nullable": false}, {"name": "transaction_index", "type": "integer", "nullable": false}, {"name": "from_address", "type": "string", "nullable": false}, {"name": "to_address", "type": "string", "nullable": true}, {"name": "value", "type": "string", "nullable": false}, // Decimal as string {"name": "gas_price", "type": "long", "nullable": true}, {"name": "gas_used", "type": "long", "nullable": true}, {"name": "gas_limit", "type": "long", "nullable": false}, {"name": "status", "type": "integer", "nullable": true}, {"name": "timestamp", "type": "timestamp", "nullable": false} ] } ``` ### Logs Parquet Schema ```json { "type": "struct", "fields": [ {"name": "chain_id", "type": "integer", "nullable": false}, {"name": "transaction_hash", "type": "string", "nullable": false}, {"name": "block_number", "type": "long", "nullable": false}, {"name": "log_index", "type": "integer", "nullable": false}, {"name": "address", "type": "string", "nullable": false}, {"name": "topic0", "type": "string", "nullable": true}, {"name": "topic1", "type": "string", "nullable": true}, {"name": "topic2", "type": "string", "nullable": true}, {"name": "topic3", "type": "string", "nullable": true}, {"name": "data", "type": "string", "nullable": true}, {"name": "timestamp", "type": "timestamp", "nullable": false} ] } ``` ### Token Transfers Parquet Schema ```json { "type": "struct", "fields": [ {"name": "chain_id", "type": "integer", "nullable": false}, {"name": "transaction_hash", "type": "string", "nullable": false}, {"name": "block_number", "type": "long", "nullable": false}, {"name": "token_address", "type": "string", "nullable": false}, {"name": "token_type", "type": "string", "nullable": false}, {"name": "from_address", "type": "string", "nullable": false}, {"name": "to_address", "type": "string", "nullable": false}, {"name": "amount", "type": "string", "nullable": true}, {"name": "token_id", "type": "string", "nullable": true}, {"name": "timestamp", "type": "timestamp", "nullable": false} ] } ``` ## Data Ingestion ### ETL Pipeline **Process**: 1. Extract: Query PostgreSQL for daily data 2. Transform: Convert to Parquet format 3. Load: Upload to S3 with partitioning **Schedule**: Daily batch job after day ends **Tools**: Apache Spark, AWS Glue, or custom ETL scripts ### Compression **Format**: Snappy compression (good balance of speed and compression ratio) **Alternative**: Gzip (better compression, slower) ### File Sizing **Target Size**: 100-500 MB per Parquet file - Smaller files: Better parallelism - Larger files: Better compression **Strategy**: Write files of target size, or split by time ranges ## Query Interface ### AWS Athena / Presto **Table Definition**: ```sql CREATE EXTERNAL TABLE blocks_138 ( chain_id int, number bigint, hash string, parent_hash string, timestamp timestamp, miner string, gas_used bigint, gas_limit bigint, transaction_count int, size int ) STORED AS PARQUET LOCATION 's3://explorer-data-lake/raw/chain_id=138/' TBLPROPERTIES ( 'projection.enabled' = 'true', 'projection.year.type' = 'integer', 'projection.year.range' = '2020,2030', 'projection.month.type' = 'integer', 'projection.month.range' = '1,12', 'projection.day.type' = 'integer', 'projection.day.range' = '1,31' ); ``` ### Query Examples **Daily Transaction Count**: ```sql SELECT DATE(timestamp) as date, COUNT(*) as transaction_count FROM transactions_138 WHERE year = 2024 AND month = 1 GROUP BY DATE(timestamp) ORDER BY date; ``` **Token Transfer Analytics**: ```sql SELECT token_address, COUNT(*) as transfer_count, SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume FROM token_transfers_138 WHERE year = 2024 AND month = 1 GROUP BY token_address ORDER BY total_volume DESC LIMIT 100; ``` ## Data Retention ### Retention Policies **Raw Data**: 7 years (compliance requirement) **Processed Aggregates**: Indefinite **Archived Data**: Move to Glacier after 1 year ### Lifecycle Policies **S3 Lifecycle Rules**: 1. Move to Infrequent Access after 30 days 2. Move to Glacier after 1 year 3. Delete after 7 years (raw data) ## Data Processing ### Aggregation Jobs **Daily Aggregates**: - Transaction counts by hour - Gas usage statistics - Token transfer volumes - Address activity metrics **Monthly Aggregates**: - Network growth metrics - Token distribution changes - Protocol usage statistics ### ML/Analytics Workflows **Use Cases**: - Anomaly detection - Fraud detection - Market analysis - Network health monitoring **Tools**: Spark, Pandas, Jupyter notebooks ## Security and Access Control ### Access Control **IAM Policies**: Restrict access to specific prefixes **Encryption**: Server-side encryption (SSE-S3 or SSE-KMS) **Audit Logging**: Enable S3 access logging ### Data Classification **Public Data**: Blocks, transactions (public blockchain data) **Sensitive Data**: User addresses, labels (requires authentication) **Compliance Data**: Banking/transaction data (strict access control) ## Cost Optimization ### Storage Optimization **Strategies**: - Use appropriate storage classes (Standard, IA, Glacier) - Compress data (Parquet + Snappy) - Delete old data per retention policy - Use intelligent tiering ### Query Optimization **Strategies**: - Partition pruning (query only relevant partitions) - Column pruning (select only needed columns) - Predicate pushdown (filter early) ## References - Database Schema: See `postgres-schema.md` - Analytics: See `../observability/metrics-monitoring.md`