Files
explorer-monorepo/docs/specs/database/data-lake-schema.md

295 lines
7.7 KiB
Markdown

# Data Lake Schema Specification
## Overview
This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.
## Storage Structure
### Directory Layout
```
s3://explorer-data-lake/
├── raw/
│ ├── chain_id=138/
│ │ ├── year=2024/
│ │ │ ├── month=01/
│ │ │ │ ├── day=01/
│ │ │ │ │ ├── blocks.parquet
│ │ │ │ │ ├── transactions.parquet
│ │ │ │ │ └── logs.parquet
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── ...
│ └── ...
├── processed/
│ ├── chain_id=138/
│ │ ├── daily_aggregates/
│ │ │ ├── year=2024/
│ │ │ │ └── month=01/
│ │ │ │ └── day=01.parquet
│ │ └── ...
│ └── ...
└── archived/
└── ...
```
### Partitioning Strategy
**Partition Keys**:
- `chain_id`: Chain identifier
- `year`: Year (YYYY)
- `month`: Month (MM)
- `day`: Day (DD)
**Benefits**:
- Efficient query pruning
- Parallel processing
- Easy data management (delete by partition)
## Parquet Schema
### Blocks Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "number", "type": "long", "nullable": false},
{"name": "hash", "type": "string", "nullable": false},
{"name": "parent_hash", "type": "string", "nullable": false},
{"name": "timestamp", "type": "timestamp", "nullable": false},
{"name": "miner", "type": "string", "nullable": true},
{"name": "gas_used", "type": "long", "nullable": true},
{"name": "gas_limit", "type": "long", "nullable": true},
{"name": "transaction_count", "type": "integer", "nullable": true},
{"name": "size", "type": "integer", "nullable": true}
]
}
```
### Transactions Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "transaction_index", "type": "integer", "nullable": false},
{"name": "from_address", "type": "string", "nullable": false},
{"name": "to_address", "type": "string", "nullable": true},
{"name": "value", "type": "string", "nullable": false}, // Decimal as string
{"name": "gas_price", "type": "long", "nullable": true},
{"name": "gas_used", "type": "long", "nullable": true},
{"name": "gas_limit", "type": "long", "nullable": false},
{"name": "status", "type": "integer", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
```
### Logs Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "transaction_hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "log_index", "type": "integer", "nullable": false},
{"name": "address", "type": "string", "nullable": false},
{"name": "topic0", "type": "string", "nullable": true},
{"name": "topic1", "type": "string", "nullable": true},
{"name": "topic2", "type": "string", "nullable": true},
{"name": "topic3", "type": "string", "nullable": true},
{"name": "data", "type": "string", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
```
### Token Transfers Parquet Schema
```json
{
"type": "struct",
"fields": [
{"name": "chain_id", "type": "integer", "nullable": false},
{"name": "transaction_hash", "type": "string", "nullable": false},
{"name": "block_number", "type": "long", "nullable": false},
{"name": "token_address", "type": "string", "nullable": false},
{"name": "token_type", "type": "string", "nullable": false},
{"name": "from_address", "type": "string", "nullable": false},
{"name": "to_address", "type": "string", "nullable": false},
{"name": "amount", "type": "string", "nullable": true},
{"name": "token_id", "type": "string", "nullable": true},
{"name": "timestamp", "type": "timestamp", "nullable": false}
]
}
```
## Data Ingestion
### ETL Pipeline
**Process**:
1. Extract: Query PostgreSQL for daily data
2. Transform: Convert to Parquet format
3. Load: Upload to S3 with partitioning
**Schedule**: Daily batch job after day ends
**Tools**: Apache Spark, AWS Glue, or custom ETL scripts
### Compression
**Format**: Snappy compression (good balance of speed and compression ratio)
**Alternative**: Gzip (better compression, slower)
### File Sizing
**Target Size**: 100-500 MB per Parquet file
- Smaller files: Better parallelism
- Larger files: Better compression
**Strategy**: Write files of target size, or split by time ranges
## Query Interface
### AWS Athena / Presto
**Table Definition**:
```sql
CREATE EXTERNAL TABLE blocks_138 (
chain_id int,
number bigint,
hash string,
parent_hash string,
timestamp timestamp,
miner string,
gas_used bigint,
gas_limit bigint,
transaction_count int,
size int
)
STORED AS PARQUET
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2020,2030',
'projection.month.type' = 'integer',
'projection.month.range' = '1,12',
'projection.day.type' = 'integer',
'projection.day.range' = '1,31'
);
```
### Query Examples
**Daily Transaction Count**:
```sql
SELECT
DATE(timestamp) as date,
COUNT(*) as transaction_count
FROM transactions_138
WHERE year = 2024 AND month = 1
GROUP BY DATE(timestamp)
ORDER BY date;
```
**Token Transfer Analytics**:
```sql
SELECT
token_address,
COUNT(*) as transfer_count,
SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
FROM token_transfers_138
WHERE year = 2024 AND month = 1
GROUP BY token_address
ORDER BY total_volume DESC
LIMIT 100;
```
## Data Retention
### Retention Policies
**Raw Data**: 7 years (compliance requirement)
**Processed Aggregates**: Indefinite
**Archived Data**: Move to Glacier after 1 year
### Lifecycle Policies
**S3 Lifecycle Rules**:
1. Move to Infrequent Access after 30 days
2. Move to Glacier after 1 year
3. Delete after 7 years (raw data)
## Data Processing
### Aggregation Jobs
**Daily Aggregates**:
- Transaction counts by hour
- Gas usage statistics
- Token transfer volumes
- Address activity metrics
**Monthly Aggregates**:
- Network growth metrics
- Token distribution changes
- Protocol usage statistics
### ML/Analytics Workflows
**Use Cases**:
- Anomaly detection
- Fraud detection
- Market analysis
- Network health monitoring
**Tools**: Spark, Pandas, Jupyter notebooks
## Security and Access Control
### Access Control
**IAM Policies**: Restrict access to specific prefixes
**Encryption**: Server-side encryption (SSE-S3 or SSE-KMS)
**Audit Logging**: Enable S3 access logging
### Data Classification
**Public Data**: Blocks, transactions (public blockchain data)
**Sensitive Data**: User addresses, labels (requires authentication)
**Compliance Data**: Banking/transaction data (strict access control)
## Cost Optimization
### Storage Optimization
**Strategies**:
- Use appropriate storage classes (Standard, IA, Glacier)
- Compress data (Parquet + Snappy)
- Delete old data per retention policy
- Use intelligent tiering
### Query Optimization
**Strategies**:
- Partition pruning (query only relevant partitions)
- Column pruning (select only needed columns)
- Predicate pushdown (filter early)
## References
- Database Schema: See `postgres-schema.md`
- Analytics: See `../observability/metrics-monitoring.md`