295 lines
7.7 KiB
Markdown
295 lines
7.7 KiB
Markdown
# Data Lake Schema Specification
|
|
|
|
## Overview
|
|
|
|
This document specifies the data lake schema for long-term storage of blockchain data in S3-compatible object storage using Parquet format for analytics, ML, and compliance purposes.
|
|
|
|
## Storage Structure
|
|
|
|
### Directory Layout
|
|
|
|
```
|
|
s3://explorer-data-lake/
|
|
├── raw/
|
|
│ ├── chain_id=138/
|
|
│ │ ├── year=2024/
|
|
│ │ │ ├── month=01/
|
|
│ │ │ │ ├── day=01/
|
|
│ │ │ │ │ ├── blocks.parquet
|
|
│ │ │ │ │ ├── transactions.parquet
|
|
│ │ │ │ │ └── logs.parquet
|
|
│ │ │ │ └── ...
|
|
│ │ │ └── ...
|
|
│ │ └── ...
|
|
│ └── ...
|
|
├── processed/
|
|
│ ├── chain_id=138/
|
|
│ │ ├── daily_aggregates/
|
|
│ │ │ ├── year=2024/
|
|
│ │ │ │ └── month=01/
|
|
│ │ │ │ └── day=01.parquet
|
|
│ │ └── ...
|
|
│ └── ...
|
|
└── archived/
|
|
└── ...
|
|
```
|
|
|
|
### Partitioning Strategy
|
|
|
|
**Partition Keys**:
|
|
- `chain_id`: Chain identifier
|
|
- `year`: Year (YYYY)
|
|
- `month`: Month (MM)
|
|
- `day`: Day (DD)
|
|
|
|
**Benefits**:
|
|
- Efficient query pruning
|
|
- Parallel processing
|
|
- Easy data management (delete by partition)
|
|
|
|
## Parquet Schema
|
|
|
|
### Blocks Parquet Schema
|
|
|
|
```json
|
|
{
|
|
"type": "struct",
|
|
"fields": [
|
|
{"name": "chain_id", "type": "integer", "nullable": false},
|
|
{"name": "number", "type": "long", "nullable": false},
|
|
{"name": "hash", "type": "string", "nullable": false},
|
|
{"name": "parent_hash", "type": "string", "nullable": false},
|
|
{"name": "timestamp", "type": "timestamp", "nullable": false},
|
|
{"name": "miner", "type": "string", "nullable": true},
|
|
{"name": "gas_used", "type": "long", "nullable": true},
|
|
{"name": "gas_limit", "type": "long", "nullable": true},
|
|
{"name": "transaction_count", "type": "integer", "nullable": true},
|
|
{"name": "size", "type": "integer", "nullable": true}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Transactions Parquet Schema
|
|
|
|
```json
|
|
{
|
|
"type": "struct",
|
|
"fields": [
|
|
{"name": "chain_id", "type": "integer", "nullable": false},
|
|
{"name": "hash", "type": "string", "nullable": false},
|
|
{"name": "block_number", "type": "long", "nullable": false},
|
|
{"name": "transaction_index", "type": "integer", "nullable": false},
|
|
{"name": "from_address", "type": "string", "nullable": false},
|
|
{"name": "to_address", "type": "string", "nullable": true},
|
|
{"name": "value", "type": "string", "nullable": false}, // Decimal as string
|
|
{"name": "gas_price", "type": "long", "nullable": true},
|
|
{"name": "gas_used", "type": "long", "nullable": true},
|
|
{"name": "gas_limit", "type": "long", "nullable": false},
|
|
{"name": "status", "type": "integer", "nullable": true},
|
|
{"name": "timestamp", "type": "timestamp", "nullable": false}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Logs Parquet Schema
|
|
|
|
```json
|
|
{
|
|
"type": "struct",
|
|
"fields": [
|
|
{"name": "chain_id", "type": "integer", "nullable": false},
|
|
{"name": "transaction_hash", "type": "string", "nullable": false},
|
|
{"name": "block_number", "type": "long", "nullable": false},
|
|
{"name": "log_index", "type": "integer", "nullable": false},
|
|
{"name": "address", "type": "string", "nullable": false},
|
|
{"name": "topic0", "type": "string", "nullable": true},
|
|
{"name": "topic1", "type": "string", "nullable": true},
|
|
{"name": "topic2", "type": "string", "nullable": true},
|
|
{"name": "topic3", "type": "string", "nullable": true},
|
|
{"name": "data", "type": "string", "nullable": true},
|
|
{"name": "timestamp", "type": "timestamp", "nullable": false}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Token Transfers Parquet Schema
|
|
|
|
```json
|
|
{
|
|
"type": "struct",
|
|
"fields": [
|
|
{"name": "chain_id", "type": "integer", "nullable": false},
|
|
{"name": "transaction_hash", "type": "string", "nullable": false},
|
|
{"name": "block_number", "type": "long", "nullable": false},
|
|
{"name": "token_address", "type": "string", "nullable": false},
|
|
{"name": "token_type", "type": "string", "nullable": false},
|
|
{"name": "from_address", "type": "string", "nullable": false},
|
|
{"name": "to_address", "type": "string", "nullable": false},
|
|
{"name": "amount", "type": "string", "nullable": true},
|
|
{"name": "token_id", "type": "string", "nullable": true},
|
|
{"name": "timestamp", "type": "timestamp", "nullable": false}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Data Ingestion
|
|
|
|
### ETL Pipeline
|
|
|
|
**Process**:
|
|
1. Extract: Query PostgreSQL for daily data
|
|
2. Transform: Convert to Parquet format
|
|
3. Load: Upload to S3 with partitioning
|
|
|
|
**Schedule**: Daily batch job after day ends
|
|
|
|
**Tools**: Apache Spark, AWS Glue, or custom ETL scripts
|
|
|
|
### Compression
|
|
|
|
**Format**: Snappy compression (good balance of speed and compression ratio)
|
|
|
|
**Alternative**: Gzip (better compression, slower)
|
|
|
|
### File Sizing
|
|
|
|
**Target Size**: 100-500 MB per Parquet file
|
|
- Smaller files: Better parallelism
|
|
- Larger files: Better compression
|
|
|
|
**Strategy**: Write files of target size, or split by time ranges
|
|
|
|
## Query Interface
|
|
|
|
### AWS Athena / Presto
|
|
|
|
**Table Definition**:
|
|
```sql
|
|
CREATE EXTERNAL TABLE blocks_138 (
|
|
chain_id int,
|
|
number bigint,
|
|
hash string,
|
|
parent_hash string,
|
|
timestamp timestamp,
|
|
miner string,
|
|
gas_used bigint,
|
|
gas_limit bigint,
|
|
transaction_count int,
|
|
size int
|
|
)
|
|
STORED AS PARQUET
|
|
LOCATION 's3://explorer-data-lake/raw/chain_id=138/'
|
|
TBLPROPERTIES (
|
|
'projection.enabled' = 'true',
|
|
'projection.year.type' = 'integer',
|
|
'projection.year.range' = '2020,2030',
|
|
'projection.month.type' = 'integer',
|
|
'projection.month.range' = '1,12',
|
|
'projection.day.type' = 'integer',
|
|
'projection.day.range' = '1,31'
|
|
);
|
|
```
|
|
|
|
### Query Examples
|
|
|
|
**Daily Transaction Count**:
|
|
```sql
|
|
SELECT
|
|
DATE(timestamp) as date,
|
|
COUNT(*) as transaction_count
|
|
FROM transactions_138
|
|
WHERE year = 2024 AND month = 1
|
|
GROUP BY DATE(timestamp)
|
|
ORDER BY date;
|
|
```
|
|
|
|
**Token Transfer Analytics**:
|
|
```sql
|
|
SELECT
|
|
token_address,
|
|
COUNT(*) as transfer_count,
|
|
SUM(CAST(amount AS DECIMAL(78, 0))) as total_volume
|
|
FROM token_transfers_138
|
|
WHERE year = 2024 AND month = 1
|
|
GROUP BY token_address
|
|
ORDER BY total_volume DESC
|
|
LIMIT 100;
|
|
```
|
|
|
|
## Data Retention
|
|
|
|
### Retention Policies
|
|
|
|
**Raw Data**: 7 years (compliance requirement)
|
|
**Processed Aggregates**: Indefinite
|
|
**Archived Data**: Move to Glacier after 1 year
|
|
|
|
### Lifecycle Policies
|
|
|
|
**S3 Lifecycle Rules**:
|
|
1. Move to Infrequent Access after 30 days
|
|
2. Move to Glacier after 1 year
|
|
3. Delete after 7 years (raw data)
|
|
|
|
## Data Processing
|
|
|
|
### Aggregation Jobs
|
|
|
|
**Daily Aggregates**:
|
|
- Transaction counts by hour
|
|
- Gas usage statistics
|
|
- Token transfer volumes
|
|
- Address activity metrics
|
|
|
|
**Monthly Aggregates**:
|
|
- Network growth metrics
|
|
- Token distribution changes
|
|
- Protocol usage statistics
|
|
|
|
### ML/Analytics Workflows
|
|
|
|
**Use Cases**:
|
|
- Anomaly detection
|
|
- Fraud detection
|
|
- Market analysis
|
|
- Network health monitoring
|
|
|
|
**Tools**: Spark, Pandas, Jupyter notebooks
|
|
|
|
## Security and Access Control
|
|
|
|
### Access Control
|
|
|
|
**IAM Policies**: Restrict access to specific prefixes
|
|
**Encryption**: Server-side encryption (SSE-S3 or SSE-KMS)
|
|
**Audit Logging**: Enable S3 access logging
|
|
|
|
### Data Classification
|
|
|
|
**Public Data**: Blocks, transactions (public blockchain data)
|
|
**Sensitive Data**: User addresses, labels (requires authentication)
|
|
**Compliance Data**: Banking/transaction data (strict access control)
|
|
|
|
## Cost Optimization
|
|
|
|
### Storage Optimization
|
|
|
|
**Strategies**:
|
|
- Use appropriate storage classes (Standard, IA, Glacier)
|
|
- Compress data (Parquet + Snappy)
|
|
- Delete old data per retention policy
|
|
- Use intelligent tiering
|
|
|
|
### Query Optimization
|
|
|
|
**Strategies**:
|
|
- Partition pruning (query only relevant partitions)
|
|
- Column pruning (select only needed columns)
|
|
- Predicate pushdown (filter early)
|
|
|
|
## References
|
|
|
|
- Database Schema: See `postgres-schema.md`
|
|
- Analytics: See `../observability/metrics-monitoring.md`
|
|
|