Lesson 12: Health Probes & Lifecycle - Building Self-Healing Systems
What We’re Building Today
In this lesson, you’ll implement a production-grade health monitoring system for a distributed log analytics platform:
Three-probe health strategy: Configure liveness, readiness, and startup probes with appropriate thresholds for different failure scenarios
Lifecycle hooks: Implement preStop and postStart hooks for graceful connection draining and warm-up sequences
Self-healing demonstration: Observe Kubernetes automatically restart failing containers and remove unhealthy pods from service
Circuit breaker integration: Connect health probes with application-level circuit breakers for cascading failure prevention
Why This Matters
Netflix discovered that 70% of their production incidents stemmed from health check misconfigurations—either probes that were too aggressive (causing unnecessary restarts during GC pauses) or too lenient (leaving zombie processes in rotation). Spotify’s migration to Kubernetes initially suffered from a 3x increase in user-facing errors because their readiness probes didn’t account for cache warm-up time.
The distinction between liveness, readiness, and startup probes isn’t academic—it’s the difference between a system that gracefully handles load spikes and one that death-spirals under pressure. A misconfigured liveness probe that kills pods during high load creates a cascading failure: fewer pods means more load per pod, which means slower responses, which means more probes fail, which means fewer pods.
Production systems require understanding probe interactions with JVM garbage collection, connection pool exhaustion, downstream dependency failures, and graceful shutdown sequences. Today’s implementation demonstrates these patterns in a realistic microservices context.
Kubernetes Health Architecture Deep Dive
The Three-Probe Model
Kubernetes provides three distinct probe types, each serving a specific purpose in the pod lifecycle:
Liveness Probe: Answers “Is this process fundamentally broken?” When this fails, Kubernetes kills the container and restarts it. Use this for detecting deadlocks, infinite loops, or corrupted state that can only be fixed by restarting. The critical mistake is making liveness probes check external dependencies—if your database is down, restarting your application won’t help.
Readiness Probe: Answers “Can this instance serve traffic right now?” When this fails, Kubernetes removes the pod from Service endpoints but doesn’t restart it. Use this for temporary conditions: cache warming, connection pool initialization, or when a downstream dependency is unavailable. Airbnb uses readiness probes to implement graceful degradation—pods mark themselves not-ready when their ML model hasn’t loaded, preventing users from seeing empty recommendations.
Startup Probe: Answers “Has this application finished initializing?” This probe runs only during startup, and once it succeeds, liveness and readiness probes take over. Critical for applications with slow startup times—without this, aggressive liveness probes would kill containers before they finish initializing.
Probe Configuration Trade-offs
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30 # Wait for app to start
periodSeconds: 10 # Check every 10s
timeoutSeconds: 5 # Fail if response > 5s
failureThreshold: 3 # Kill after 3 failures
successThreshold: 1 # Recover after 1 success
The interaction between these parameters determines system behavior under stress:
initialDelaySeconds+failureThreshold×periodSeconds= Maximum time before restart. Set this higher than your worst-case startup time plus some buffer.timeoutSecondsshould be set relative to your P99 latency. If your endpoint normally responds in 100ms but occasionally hits 2s during GC, a 1s timeout causes unnecessary restarts.failureThresholdprovides tolerance for transient failures. Network blips, brief GC pauses, or momentary load spikes shouldn’t trigger restarts.
LinkedIn learned this lesson when their 1-second timeout with failureThreshold: 1 caused cascading restarts during minor network congestion.
Lifecycle Hooks: Graceful Transitions
Lifecycle hooks execute at critical moments in a container’s life:
lifecycle:
postStart:
exec:
command: [”/bin/sh”, “-c”, “warm-cache.sh”]
preStop:
httpGet:
path: /shutdown
port: 8080
postStart runs immediately after container creation but before the container is marked Running. Use it for initialization that must complete before any probes run—like pre-populating caches or establishing connection pools. The container isn’t Ready until postStart completes successfully.
preStop is your graceful shutdown hook. When Kubernetes decides to terminate a pod, it sends a SIGTERM and simultaneously executes preStop. You have terminationGracePeriodSeconds (default 30s) to:
Stop accepting new connections
Drain existing requests
Close database connections
Flush buffers
The pattern used by Pinterest and Stripe: preStop hook tells load balancers to stop sending traffic, waits for in-flight requests to complete, then exits cleanly.
Common Anti-Patterns
Anti-pattern 1: Liveness probes that check dependencies
# DON’T DO THIS
livenessProbe:
httpGet:
path: /health # Checks database connectivity
If your database is down, restarting all your application pods makes things worse. Liveness should only check “is my process working?”
Anti-pattern 2: Same endpoint for liveness and readiness
# DON’T DO THIS
livenessProbe:
httpGet:
path: /health
readinessProbe:
httpGet:
path: /health # Same endpoint!
These probes have different purposes. Readiness can fail when downstream services are unavailable; liveness should only fail when the process itself is broken.
Anti-pattern 3: No startup probe for slow applications
# DON’T DO THIS
livenessProbe:
initialDelaySeconds: 300 # 5 minute delay!
Long initialDelaySeconds means no health checking for 5 minutes after restart. Use startup probes instead.
Implementation Walkthrough
Github Link:
https://github.com/sysdr/k8s_course/tree/main/lesson12/k8s-health-probes-systemOur log analytics system demonstrates health probes across three service types:
Log Collector Service (Fast Startup)
The collector starts quickly but needs graceful shutdown to flush buffered logs:
Liveness: Check internal state only
Readiness: Check Kafka connectivity
preStop: Flush buffers and close Kafka producers
Log Processor Service (Slow Startup with Caching)
The processor loads ML models and warms caches:
Startup probe: Wait for model loading (up to 2 minutes)
Liveness: Check for deadlocks in processing threads
Readiness: Check cache warm-up status
postStart: Trigger async cache warm-up
Analytics API Service (High Availability)
The API serves user-facing traffic with strict latency requirements:
Liveness: Endpoint that never checks dependencies
Readiness: Check database connection pool health
preStop: 10-second sleep for connection draining
Each service exposes dedicated endpoints: /health/live, /health/ready, and /health/startup with different implementations based on service characteristics.
Production Considerations
Monitoring Probe Failures
Prometheus metrics for probe behavior are essential:
kube_pod_container_status_restarts_total: Restart count indicates liveness failureskube_pod_status_ready: Ready status changes indicate readiness failuresCustom application metrics for probe endpoint latency
Alert on unusual restart patterns—a pod restarting more than twice per hour usually indicates probe misconfiguration or resource starvation.
Failure Scenarios to Test
Database outage: Readiness should fail, liveness should pass
Memory leak: Liveness should eventually fail (OOM or deadlock)
High load: Probes should tolerate latency spikes
Rolling update: New pods shouldn’t receive traffic until cache is warm
Resource Interaction
Probes consume resources. An HTTP probe creating a new connection every 10 seconds across 100 pods generates significant load. Consider:
Using TCP probes instead of HTTP for simple port checks
Adjusting probe frequency based on service criticality
Using gRPC health checking protocol for gRPC services
Working Code Demo:
Scale Connection
At FAANG scale, health probes become critical infrastructure. Google’s Borg (Kubernetes’ predecessor) processes millions of health checks per second. Their key patterns:
Hierarchical health: Pods report to node-level aggregators, which report to cluster-level dashboards
Predictive health: ML models predict failures before probes catch them
Probe load shedding: During incidents, reduce probe frequency to free resources
Airbnb runs probes at different frequencies based on service tier—critical path services check every 5 seconds while batch processors check every 30 seconds.
Next Steps
Tomorrow’s lesson on Resource Management builds directly on health probes. You’ll learn how resource requests and limits interact with probe behavior—specifically, how a container approaching its memory limit responds to health checks, and how to set probe timeouts relative to CPU throttling behavior.
“A system without health checks is like driving without a dashboard—you won’t know you’re in trouble until you’ve already crashed.” — SRE Proverb

