- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control. - Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities. - Created .gitmodules to include OpenZeppelin contracts as a submodule. - Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment. - Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks. - Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring. - Created scripts for resource import and usage validation across non-US regions. - Added tests for CCIP error handling and integration to ensure robust functionality. - Included various new files and directories for the orchestration portal and deployment scripts.
4.7 KiB
4.7 KiB
RPC Service Level Objectives (SLO)
Service level objectives for RPC endpoints on ChainID 138.
Overview
This document defines the service level objectives for RPC endpoints serving ChainID 138 (DeFi Oracle Meta Mainnet).
RPC Endpoints
Primary Endpoint
- URL:
https://rpc.d-bis.org - Protocol: HTTPS
- WebSocket:
wss://rpc.d-bis.org - Location: Azure (Primary region)
- Infrastructure: Azure Application Gateway + AKS RPC nodes
Secondary Endpoint
- URL:
https://rpc2.d-bis.org - Protocol: HTTPS
- WebSocket:
wss://rpc2.d-bis.org - Location: Azure (Secondary region)
- Infrastructure: Azure Application Gateway + AKS RPC nodes
Service Level Objectives
Availability
- Target: ≥99.9% monthly uptime
- Measurement: Percentage of time RPC endpoints are accessible
- Monitoring: Azure Monitor, Prometheus, Status page
- Alerting: Alert on <99.9% uptime
Latency
- Target: <200ms p95 latency
- Measurement: 95th percentile response time
- Monitoring: Azure Application Insights, Prometheus
- Alerting: Alert on >200ms p95 latency
Throughput
- Target: 1000+ requests/second
- Measurement: Requests per second (RPS)
- Monitoring: Azure Monitor, Prometheus
- Alerting: Alert on capacity issues
Error Rate
- Target: <0.1% error rate
- Measurement: Percentage of requests that result in errors
- Monitoring: Azure Monitor, Prometheus
- Alerting: Alert on >0.1% error rate
Service Level Indicators (SLI)
Uptime SLI
Uptime SLI = (Successful requests / Total requests) * 100
Latency SLI
Latency SLI = p95 response time
Throughput SLI
Throughput SLI = Requests per second
Error Rate SLI
Error Rate SLI = (Error requests / Total requests) * 100
Monitoring
Metrics
- Uptime: Percentage of time endpoints are up
- Latency: Response time percentiles (p50, p95, p99)
- Throughput: Requests per second
- Error Rate: Percentage of errors
- Availability: Endpoint availability status
Tools
- Azure Monitor: Cloud monitoring
- Prometheus: Metrics collection
- Grafana: Metrics visualization
- Application Insights: Application performance monitoring
- Status Page: Public status page
Alerting
Alerts
- Uptime < 99.9%: Critical alert
- Latency > 200ms p95: Warning alert
- Throughput > 90% capacity: Warning alert
- Error Rate > 0.1%: Critical alert
- Endpoint down: Critical alert
Notification Channels
- Email: Operations team
- Slack: Operations channel
- PagerDuty: On-call rotation
- SMS: Critical alerts only
Status Page
Public Status Page
- URL:
https://status.d-bis.org(to be created) - Updates: Real-time status updates
- Incidents: Incident reporting
- Maintenance: Maintenance windows
Status Indicators
- Operational: All systems operational
- Degraded: Some issues, but service available
- Outage: Service unavailable
- Maintenance: Scheduled maintenance
Incident Response
Severity Levels
- Critical: Service completely down
- High: Significant degradation
- Medium: Minor issues
- Low: Informational
Response Times
- Critical: 15 minutes
- High: 1 hour
- Medium: 4 hours
- Low: 24 hours
Escalation
- Level 1: On-call engineer
- Level 2: Senior engineer
- Level 3: Engineering manager
- Level 4: CTO
Disaster Recovery
Backup Endpoints
- Primary:
https://rpc.d-bis.org - Secondary:
https://rpc2.d-bis.org - Tertiary: [To be configured]
Failover
- Automatic: DNS-based failover
- Manual: Manual failover procedures
- Testing: Quarterly failover tests
Capacity Planning
Current Capacity
- RPS: 1000+ requests/second
- Concurrent Connections: 10,000+
- Bandwidth: 1 Gbps+
Scaling
- Horizontal: Add more RPC nodes
- Vertical: Increase node resources
- Auto-scaling: Kubernetes auto-scaling
- Load Balancing: Application Gateway load balancing
Reporting
Monthly Reports
- Uptime: Monthly uptime percentage
- Latency: Average and p95 latency
- Throughput: Average and peak throughput
- Error Rate: Error rate percentage
- Incidents: Number and duration of incidents
Quarterly Reviews
- SLO Performance: Review SLO performance
- Improvements: Identify improvements
- Capacity Planning: Plan for capacity increases
- Disaster Recovery: Review disaster recovery procedures