How to write Fault-Tolerant Code

Writing fault-tolerant code means designing systems that continue to function — gracefully — even when parts fail.

In cloud and distributed systems (especially if you’re working with serverless or AWS), this is critical.


Core Principles of Fault-Tolerant Code

1. Assume Everything Can Fail

Networks fail. APIs time out. Databases go down.
Write code assuming:

  • External services may not respond
  • Messages may be duplicated
  • Data may arrive out of order
  • Requests may partially succeed

2. Implement Retries (Correctly)

Temporary failures (timeouts, 5xx errors) should be retried.

Best Practices:

  • Use exponential backoff
  • Add jitter (random delay)
  • Limit retry count

Example (conceptual):

for attempt in range(5):
try:
call_api()
break
except TemporaryError:
sleep(randomized_exponential_backoff(attempt))

⚠ Never retry blindly — especially for non-idempotent operations.


3. Make Operations Idempotent

An operation is idempotent if running it multiple times has the same result as running it once.

Example:

  • Charging a credit card ❌ (not idempotent unless protected)
  • Updating a user’s status to “active” ✅

Use:

  • Idempotency keys
  • Unique transaction IDs
  • Database constraints

This is crucial in event-driven systems like:

  • AWS Lambda
  • Amazon SQS

Because messages may be delivered more than once.


4. Use Circuit Breakers

Prevent cascading failures.

If a downstream service keeps failing:

  • Stop calling it temporarily
  • Fail fast
  • Retry after a cooldown period

This protects your system from meltdown.


5. Set Timeouts Everywhere

Never allow infinite waits.

Always define:

  • Connection timeout
  • Read timeout
  • Execution timeout

Without timeouts, threads can hang and exhaust resources.


6. Graceful Degradation

If a feature fails, don’t crash the whole system.

Example:

  • Recommendation engine fails → Show default recommendations.
  • Analytics service fails → Skip logging, but complete user checkout.

7. Bulkheads (Isolation)

Isolate components so failures don’t spread.

Example:

  • Separate thread pools
  • Separate queues
  • Separate databases per service

In cloud:

  • Separate microservices
  • Separate autoscaling groups

8. Validate Inputs Strictly

Many outages are caused by unexpected data.

  • Validate schemas
  • Reject malformed requests early
  • Fail fast and clearly

9. Monitor & Log Properly

Fault tolerance isn’t just code — it’s observability.

Use:

  • Structured logging
  • Distributed tracing
  • Alerts on error rates
  • Health checks

10. Design for Redundancy

At system level:

  • Multi-AZ deployments
  • Load balancers
  • Replicated databases

Example cloud services:

  • Amazon Web Services multi-AZ services
  • Auto-scaling groups
  • Managed databases with replicas

Leave a comment