# Service State Machine **Last Updated:** 2025-01-20 **Document Version:** 1.0 **Status:** Active Documentation --- ## Overview This document defines the state machine for services in the infrastructure, including valid states, transitions, and recovery actions. --- ## Service State Diagram ```mermaid stateDiagram-v2 [*] --> Stopped Stopped --> Starting: start() Starting --> Running: initialized successfully Starting --> Error: initialization failed Running --> Stopping: stop() Running --> Error: runtime error Stopping --> Stopped: stopped successfully Stopping --> Error: stop failed Error --> Stopped: reset() Error --> Starting: restart() Running --> Restarting: restart() Restarting --> Starting: restart initiated ``` --- ## State Definitions ### Stopped **Description:** Service is not running **Characteristics:** - No processes active - No resources allocated - Configuration may be present **Entry Conditions:** - Initial state - After successful stop - After reset from error **Exit Conditions:** - Service started (`start()`) --- ### Starting **Description:** Service is initializing **Characteristics:** - Process starting - Configuration loading - Resources being allocated - Network connections being established **Entry Conditions:** - Service start requested - Restart initiated **Exit Conditions:** - Initialization successful → Running - Initialization failed → Error **Typical Duration:** - 10-60 seconds (depending on service) --- ### Running **Description:** Service is operational **Characteristics:** - Process active - Handling requests - Monitoring active - Health checks passing **Entry Conditions:** - Successful initialization - Service started successfully **Exit Conditions:** - Stop requested → Stopping - Runtime error → Error - Restart requested → Restarting **Verification:** - Health check endpoint responding - Service logs showing normal operation - Metrics indicating activity --- ### Stopping **Description:** Service is shutting down **Characteristics:** - Graceful shutdown in progress - Finishing current requests - Releasing resources - Closing connections **Entry Conditions:** - Stop requested - Service shutdown initiated **Exit Conditions:** - Shutdown successful → Stopped - Shutdown failed → Error **Typical Duration:** - 5-30 seconds (graceful shutdown) --- ### Error **Description:** Service is in error state **Characteristics:** - Service not functioning correctly - Error logs present - May be partially running - Requires intervention **Entry Conditions:** - Initialization failed - Runtime error occurred - Stop operation failed **Exit Conditions:** - Reset requested → Stopped - Restart requested → Starting **Recovery Actions:** - Check error logs - Verify configuration - Check dependencies - Restart service --- ### Restarting **Description:** Service restart in progress **Characteristics:** - Stop operation initiated - Will transition to Starting after stop **Entry Conditions:** - Restart requested while Running **Exit Conditions:** - Stop complete → Starting --- ## State Transitions ### Transition: start() **From:** Stopped **To:** Starting **Action:** Start service process **Verification:** Process started, logs show initialization --- ### Transition: initialized successfully **From:** Starting **To:** Running **Condition:** All initialization steps completed **Verification:** Health check passes, service responding --- ### Transition: initialization failed **From:** Starting **To:** Error **Condition:** Initialization error occurred **Action:** Log error, stop process **Recovery:** Check logs, fix configuration, restart --- ### Transition: stop() **From:** Running **To:** Stopping **Action:** Initiate graceful shutdown **Verification:** Shutdown process started --- ### Transition: stopped successfully **From:** Stopping **To:** Stopped **Condition:** Shutdown completed **Verification:** Process terminated, resources released --- ### Transition: stop failed **From:** Stopping **To:** Error **Condition:** Shutdown error occurred **Action:** Force stop if needed **Recovery:** Manual intervention may be required --- ### Transition: runtime error **From:** Running **To:** Error **Condition:** Runtime error detected **Action:** Log error, attempt recovery **Recovery:** Check logs, fix issue, restart --- ### Transition: reset() **From:** Error **To:** Stopped **Action:** Reset service to clean state **Verification:** Service stopped, error state cleared --- ### Transition: restart() **From:** Error **To:** Starting **Action:** Restart service from error state **Verification:** Service starting, initialization in progress --- ## Service-Specific State Machines ### Besu Node States **Additional States:** - **Syncing:** Blockchain synchronization in progress - **Synced:** Blockchain fully synchronized - **Consensus:** Participating in consensus (validators) **State Flow:** ``` Starting → Syncing → Synced → Running (with Consensus if validator) ``` --- ### Cloudflare Tunnel States **Additional States:** - **Connecting:** Establishing tunnel connection - **Connected:** Tunnel connected to Cloudflare - **Reconnecting:** Reconnecting after disconnection **State Flow:** ``` Starting → Connecting → Connected → Running Running → Reconnecting → Connected → Running ``` --- ## Monitoring and Alerts ### State Monitoring **Metrics to Track:** - Current state - State transition frequency - Time in each state - Error state occurrences **Alerts:** - Service in Error state > 5 minutes - Frequent state transitions (thrashing) - Service stuck in Starting > 10 minutes - Service in Stopping > 2 minutes --- ## Recovery Procedures ### From Error State **Step 1: Diagnose** ```bash # Check service logs journalctl -u -n 100 # Check service status systemctl status # Check error messages journalctl -u | grep -i error ``` **Step 2: Fix Issue** - Fix configuration errors - Resolve dependency issues - Address resource constraints - Fix network problems **Step 3: Recover** ```bash # Option 1: Restart systemctl restart # Option 2: Reset and start systemctl stop # Fix issues systemctl start ``` --- ## Related Documentation - **[OPERATIONAL_RUNBOOKS.md](../03-deployment/OPERATIONAL_RUNBOOKS.md)** ⭐⭐ - Operational procedures - **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** ⭐⭐⭐ - Troubleshooting guide - **[BESU_ALLOWLIST_RUNBOOK.md](../06-besu/BESU_ALLOWLIST_RUNBOOK.md)** ⭐ - Besu allowlist and node operations --- **Last Updated:** 2025-01-20 **Review Cycle:** Quarterly