Aicosoft - AI & Technology News, Insights & Innovation

Ever had one of those days? You push a small change, everything looks fine, and then you watch in slow-motion horror as one tiny service hiccup starts a domino effect that brings down the entire system.

Yeah, me too. It’s a terrible feeling.

And more often than not, the culprit is something we build into our systems to be helpful: retries. We tell our services, "If something fails, just try again!" It sounds logical, but without the right safeguards, it’s like pouring gasoline on a fire. You end up with what’s called a cascading failure, where one struggling service takes everyone else down with it.

Today, I want to pull back the curtain on this. We're going to get our hands dirty and build two simple systems to see exactly how this chaos unfolds. We’ll compare the classic, direct approach (RPC) with the more decoupled, asynchronous world of event-driven architecture. This isn't just theory; it's a practical look at why systems break and how we can build them to be more resilient.

Let's Build a Service That's Designed to Fail

Before we can see how things break, we need something to break. Let’s imagine a simple downstream service. Maybe it’s a user profile service, an inventory checker, or a payment processor.

In the real world, services don't just work perfectly or fail completely. They get slow. They get overloaded. So, we're going to simulate that. Our fake service will have:

Base Latency: A normal response time, plus a little random "jitter" to keep things realistic.
Failure Probability: Sometimes, it will just fail for no good reason. It happens.
Overload Behavior: This is the key. When the service gets too many requests at once, it gets slower and starts failing a lot more often.

Think of it like a barista at a coffee shop. When there are only a few customers, they're fast and accurate. But when a busload of tourists shows up (a traffic spike!), they get flustered, orders take forever, and they start getting them wrong. That’s our service.

Our Toolkit for Taming the Chaos

Okay, so we know our service is going to struggle under pressure. How do we protect the rest of our system from its meltdowns? We have a few classic tools in our engineering toolbox.

Exponential Backoff: This is the simplest and most important one. Instead of retrying immediately, you wait a bit. If it fails again, you wait longer before the next retry, and then even longer. It’s the polite way of saying, "Hey, you seem busy. I'll check back in a bit." This gives the struggling service time to recover.
Bulkheads: Imagine the hull of a ship. It's divided into watertight compartments (bulkheads) so that if one section floods, it doesn't sink the whole ship. In software, a bulkhead limits how many concurrent requests can be made to a specific service. If the limit is 20, the 21st request doesn't even try; it fails fast, preventing the caller from getting stuck and the downstream service from getting even more overwhelmed.
Circuit Breakers: Just like the circuit breaker in your house, this pattern monitors for failures. If it sees too many errors in a short period, it "trips" and stops sending requests to the failing service altogether for a little while. This gives the service a real chance to recover and prevents the calling services from wasting time on requests that are doomed to fail anyway.

With our breakable service and our safety tools ready, let's see how our two different architectural styles handle the pressure.

Approach #1: The RPC Method (The Direct Phone Call)

RPC, or Remote Procedure Call, is the most straightforward way for services to talk. Service A needs something from Service B, so it makes a direct call and waits for the response. It’s like picking up the phone and calling someone. Simple, direct, and you get an immediate answer.

In our experiment, we'll send a burst of traffic to our downstream service using this RPC model. We’ll give it a timeout (we can't wait forever) and allow it to retry up to 3 times with exponential backoff. We’ll also put it behind a bulkhead and a circuit breaker.

So, what happens when the traffic spike hits?

Initially, things work fine. But as our downstream service gets overloaded, its response times shoot up. Requests start timing out. The retries kick in, but since the service is already drowning, these retries just add more requests to the pile, making the problem worse.

This is the classic failure cascade. The bulkhead helps by limiting the damage, and the circuit breaker eventually trips, giving the system a breather. But you can see how tightly coupled everything is. The caller's fate is directly tied to the downstream service's health. If the downstream service is having a bad day, so is everyone who calls it.

Approach #2: The Event-Driven Method (The Mailbox)

Now let's try a different approach. Instead of a direct phone call, we'll use an event-driven model. This is more like sending a letter or an email. Service A doesn't call Service B directly. Instead, it publishes an "event" (a message saying what it needs) to a message queue, or an "event bus."

Then, a pool of independent "consumers" picks up messages from that queue and processes them whenever they're ready.

The beauty of this is decoupling. The service publishing the event doesn't have to wait. It just drops the message in the mailbox and moves on. The queue acts as a giant shock absorber. When our traffic burst hits, the queue just fills up. The consumers can then process the messages at a steady pace that our downstream service can actually handle.

Of course, it’s not magic. What happens if a message fails to process? We handle it the same way: retry with exponential backoff. But here's the difference: the retry is handled by the consumer. It can place the event back in the queue to be tried again later. If it fails too many times, we give up and move it to a "Dead-Letter Queue" (DLQ). The DLQ is basically a pile of "return to sender" mail that we can inspect later to figure out what went wrong, without blocking the whole system.

The Showdown: Let's Look at the Results

So we ran the same bursty traffic pattern against both systems. What did we find?

The results are pretty stark.

RPC System: During the traffic burst, its success rate plummeted, and its p99 latency (the experience of your unluckiest 1% of users) went through the roof. The circuit breaker was tripping like crazy, and a ton of requests failed outright because of timeouts and bulkhead limits. It survived, but it was a rough ride for users.
Event-Driven System: This system weathered the storm beautifully. The queue absorbed the initial spike, and while the processing latency for some events was higher (since they had to wait in the queue), every single request was eventually processed successfully. There were retries, but they happened in the background and didn't impact the service that originally sent the request. No cascading failure.

So, What's the Real Takeaway Here?

It’s tempting to look at this and say, "Event-driven is better! Let's use it for everything!" But it's not that simple. Nothing in engineering ever is.

The RPC approach gave us much lower latency when the system was healthy. It's direct and simple to reason about. For requests where you need an immediate answer (like "Is this user logged in?"), it's often the right choice.

The event-driven approach gave us incredible resilience and scalability, but it came at the cost of complexity. You have to manage queues, worry about message ordering, and monitor your DLQ. It introduces eventual consistency, meaning the work gets done, but not necessarily right away.

The real lesson isn't about choosing one architecture over the other. It's about understanding the trade-offs and recognizing that resilience is a feature you have to build intentionally.

Whether you're making a direct RPC call or publishing an event, you need to think about failure. You need smart retries with exponential backoff. You need bulkheads to contain floods. You need circuit breakers to give services a break. Building robust distributed systems isn't about picking a single "best" architecture; it's about combining the right communication model with these disciplined, failure-aware patterns.

Why Your Retries Are Crashing Your System: A Deep Dive into RPC vs. Event-Driven Failures