Reliability Patterns

A set of defensive patterns — Circuit Breaker, Retry, Bulkhead, Timeout, and Rate Limiter — that prevent a single service failure from cascading through a distributed system, implemented in Java/Spring Boot with Resilience4j.

What Problem Does It Solve?

In a distributed system, remote calls fail. Network timeouts, dependency service restarts, database connection exhaustion — any call across a network boundary can fail or hang. Without protection, a slow downstream service causes threads to pile up waiting for responses, exhausting the thread pool, which then causes the calling service to time out for its callers, and so on up the dependency chain. This cascading failure can bring down an entire system because of one slow microservice.

Reliability patterns are circuit breakers, retries, and bulkheads built into the calling code itself — the service doesn't need to trust that its dependencies are always available. They make individual services resilient in the face of unreliable dependencies.

The Five Patterns

Circuit Breaker

A circuit breaker monitors calls to a dependency. When the failure rate crosses a threshold, the breaker opens — subsequent calls are rejected immediately (fast-fail) without even attempting the downstream call. After a wait window, the breaker moves to half-open: a few probe requests are let through. If they succeed, the breaker closes again.

Caption: Circuit breaker state machine — the breaker protects downstream dependencies by fast-failing in the Open state, then cautiously re-testing in Half-Open.

Key benefit: prevents threads from blocking on a known-failing dependency. The fast-fail response is returned in microseconds instead of waiting for a network timeout.

Retry

Retries handle transient failures — network blips, brief service restarts, database connection pool exhaustion that resolves in milliseconds. A retry policy specifies how many times to retry, with what delay, and for which exception types.

warning

Retry without a Circuit Breaker can amplify a failing service's load. If 100 requests fail and each retries 3 times, the failing service receives 400 requests. Always combine Retry + Circuit Breaker.

Bulkhead

A bulkhead limits the number of concurrent calls to a dependency. Named after ship bulkheads that partition the hull into watertight compartments — if one compartment floods, others stay dry.

Two implementations:

Semaphore Bulkhead (thread pool shared): limits concurrent calls via a semaphore count
Thread Pool Bulkhead: dedicated thread pool per downstream service; the calling thread is freed immediately

Caption: Bulkhead isolation — a slow Payment Service consuming all 5 of its bulkhead slots doesn't starve User Service calls, which have their own independent slot pool.

Timeout

Set a maximum duration for a call. If the downstream service doesn't respond in time, cancel the call and return a fallback. Timeouts are foundational — without them, a hung dependency occupies a thread indefinitely.

tip

Set timeouts at every point in the call chain. If Service A calls B calls C, and C has a 30s timeout, B has a 20s timeout, and A has a 10s timeout — the outer timeouts are always tighter than the inner ones.

Rate Limiter

Controls the rate at which your service calls a downstream dependency (or the rate at which clients call your service). Protects the downstream from being overwhelmed by your service during retries or traffic spikes.

Resilience4j in Spring Boot

Resilience4j is the recommended resilience library for Spring Boot (replacing Hystrix, which is no longer maintained).

Setup (Spring Boot 3)

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.2.0</version>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>  <!-- ← required for annotations -->
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>  <!-- for health/metrics -->
</dependency>

Configuration (application.yml)

resilience4j:
  circuitbreaker:
    instances:
      userService:
        sliding-window-type: COUNT_BASED          # ← COUNT_BASED or TIME_BASED
        sliding-window-size: 10                   # ← observe last 10 calls
        failure-rate-threshold: 50                # ← open if 50%+ of calls fail
        wait-duration-in-open-state: 30s          # ← stay open for 30 seconds
        permitted-number-of-calls-in-half-open-state: 3  # ← probe 3 calls in HALF-OPEN
        automatic-transition-from-open-to-half-open-enabled: true

  retry:
    instances:
      userService:
        max-attempts: 3                           # ← try 3 times total
        wait-duration: 500ms                      # ← 500ms between retries
        exponential-backoff-multiplier: 2         # ← double wait on each retry
        retry-exceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
        ignore-exceptions:
          - com.example.exceptions.UserNotFoundException  # ← don't retry on 404

  bulkhead:
    instances:
      userService:
        max-concurrent-calls: 10                  # ← max 10 parallel calls
        max-wait-duration: 0ms                    # ← don't queue; fail fast

  timelimiter:
    instances:
      userService:
        timeout-duration: 2s                      # ← cancel if no response in 2s

  ratelimiter:
    instances:
      userService:
        limit-for-period: 100                     # ← 100 calls per refresh period
        limit-refresh-period: 1s                  # ← refresh every second
        timeout-duration: 0ms                     # ← don't queue; reject immediately

Annotation-Driven Usage

@Service
public class UserServiceClient {

    private final RestClient restClient;

    // ✅ Annotation decorators are applied in order: TimeLimiter → CircuitBreaker → Retry → Bulkhead
    @CircuitBreaker(name = "userService", fallbackMethod = "getUserFallback")
    @Retry(name = "userService")
    @Bulkhead(name = "userService")
    public UserDto getUser(Long userId) {
        return restClient.get()
            .uri("/users/{id}", userId)
            .retrieve()
            .body(UserDto.class);
    }

    // ✅ Fallback: same return type, extra Throwable parameter
    public UserDto getUserFallback(Long userId, Exception ex) {
        log.warn("Circuit open for user {}, returning stub. Cause: {}", userId, ex.getMessage());
        return UserDto.stub(userId);  // ← return a safe default, not null
    }
}

Programmatic Usage (when annotations aren't flexible enough)

@Service
public class InventoryServiceClient {

    private final CircuitBreaker circuitBreaker;
    private final Retry retry;

    public InventoryServiceClient(CircuitBreakerRegistry cbRegistry,
                                   RetryRegistry retryRegistry) {
        this.circuitBreaker = cbRegistry.circuitBreaker("inventoryService");
        this.retry          = retryRegistry.retry("inventoryService");
    }

    public int getStock(Long productId) {
        Supplier<Integer> decorated = Decorators
            .ofSupplier(() -> callInventoryApi(productId)) // ← the actual call
            .withCircuitBreaker(circuitBreaker)
            .withRetry(retry)
            .withFallback(List.of(CallNotPermittedException.class,
                                  IOException.class),
                          ex -> 0) // ← fallback: return 0 stock on failure
            .decorate();

        return decorated.get();
    }
}

Monitoring with Actuator

Add to application.yml:

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,circuitbreakers
  health:
    circuitbreakers:
      enabled: true

GET /actuator/health will include circuit breaker state:

{
  "components": {
    "circuitBreakers": {
      "status": "UP",
      "details": {
        "userService": {
          "status": "CIRCUIT_CLOSED",
          "details": {
            "failureRate": "10.0%",
            "bufferedCalls": 10,
            "state": "CLOSED"
          }
        }
      }
    }
  }
}

Pattern Interaction

The patterns compose and reinforce each other:

Caption: Resilience4j layered pattern stack — requests pass through Rate Limiter → Timeout → Bulkhead → Circuit Breaker → Retry before reaching the downstream service. Any layer can deflect to the fallback.

Trade-offs & When To Use / Avoid

Pattern	Use when	Avoid when
Circuit Breaker	Calling external/downstream services in microservices	Calling your own in-memory components
Retry	Transient failures (network blips, brief restarts)	Non-idempotent operations without idempotency keys; permanent errors (404, 400)
Bulkhead	You have multiple downstream dependencies sharing a thread pool	Single-dependency services where isolation isn't needed
Timeout	Always — every remote call should have a timeout	No exceptions
Rate Limiter	Calling third-party APIs with quotas; protecting your service	Latency-sensitive internal calls between well-known services

Common Pitfalls

Retrying non-idempotent operations: retrying a POST /payments three times creates three payment charges. Guard with idempotency keys or restrict retries to GET and PUT operations only.

Fallback hiding real errors: silently returning a zero-stock fallback when the inventory service is down means customers can order out-of-stock items. Log and alert on fallback activations; don't treat them as normal flow.

Open circuit staying open too long: wait-duration-in-open-state: 300s means a 5-minute blackout even after a recovered service. Use shorter wait durations with automatic transition to half-open.

Too few observations: sliding-window-size: 2 means two failed calls open the circuit, even if they're normal transient failures. Use at least 10–20 observations.

Not testing fallbacks: fallback paths are rarely exercised in integration tests. Use Resilience4j's test support to programmatically open the circuit and verify fallback behavior.

Interview Questions

Beginner

Q: What is a Circuit Breaker pattern? A: A circuit breaker monitors calls to a dependency. When failures exceed a threshold, it opens and subsequent calls fail immediately without hitting the dependency. After a wait period, it moves to half-open and allows a few probe requests. If they succeed, it closes again. This prevents cascading failures when a downstream service is degraded.

Q: Why is Timeout the most fundamental reliability pattern? A: Without a timeout, a thread waiting on a hung downstream service is blocked indefinitely. If enough requests pile up, the thread pool is exhausted and the calling service becomes unresponsive too. Every remote call must have a max wait duration, making Timeout the foundational safety net that all other patterns build on.

Intermediate

Q: What is the difference between a Bulkhead and a Circuit Breaker? A: A Circuit Breaker monitors failure rate and opens when the service is degraded — it prevents calls to a failing dependency. A Bulkhead limits concurrent calls — it prevents a slow dependency from consuming all available threads even if it's technically responding (just slowly). They solve different failure modes and are typically used together.

Q: When should you NOT retry a request? A: Don't retry: non-idempotent operations without idempotency keys (creates duplicates); client errors (4xx — a bad request won't succeed on retry); when the Circuit Breaker is open (retrying an open circuit adds load with no benefit). Only retry on transient errors (5xx, network timeouts, IOException) for idempotent or idempotency-keyed operations.

Q: How does Resilience4j differ from Hystrix? A: Hystrix used a thread-pool-per-dependency model, which improved isolation but added thread overhead. Resilience4j uses a semaphore model by default (no extra threads), is lighter weight, functional/decorator-based, more configurable, and actively maintained. Hystrix entered maintenance mode in 2018. Resilience4j is the current recommendation for Spring Boot.

Advanced

Q: How would you design a circuit breaker strategy for a payment service that must never double-charge? A: First, make the payment call idempotent using an Idempotency-Key header — the payment provider deduplicates on their side. Then configure the retry to only retry on network-level errors (IOException, TimeoutException), not on application-level errors. Set a conservative circuit breaker threshold (e.g., 30% failure rate over the last 20 calls) and a short open duration (15s) to auto-recover. The fallback should enqueue the payment for manual review, not silently discard it. Monitor fallback activations with an alert — a consistently open circuit means a systemic problem, not transient noise.

Q: How do you test circuit breaker behavior in a Spring Boot integration test? A: Use Resilience4j's CircuitBreakerRegistry programmatically in the test to transition the circuit breaker to the OPEN state before making the call, then verify the fallback method was invoked. For component tests, MockServer or WireMock can simulate a downstream service returning 500 errors — run enough calls to trip the circuit breaker threshold, then assert that subsequent calls fast-fail and the fallback is returned.

What Problem Does It Solve?​

The Five Patterns​

Circuit Breaker​

Retry​

Bulkhead​

Timeout​

Rate Limiter​

Resilience4j in Spring Boot​

Setup (Spring Boot 3)​

Configuration (application.yml)​

Annotation-Driven Usage​

Programmatic Usage (when annotations aren't flexible enough)​

Monitoring with Actuator​

Pattern Interaction​

Trade-offs & When To Use / Avoid​

Common Pitfalls​

Interview Questions​

Beginner​

Intermediate​

Advanced​

Further Reading​

Related Notes​