Have you used retries in your software's? Did you consider the pros/cons before choosing it? In this article, we are going to take deep-dive into exponential retries in specific(most preferred retries). Do you think that all retries is only going to save you from transient issues? As the above image, this article will uncover something that might erupt one day! Let's dive in!
Note:
The article will not claim that exponential retries is bad in general. When using retries patterns, use it with extreme cautious.
The article will not claim that exponential retries is bad in general. When using retries patterns, use it with extreme cautious.
Retries
"When we fail in attempting X, performing retries often results in success"
Why do we need retries?
- Services often has downtimes which can be intermittent/complete outage.
- Most reliable services advertises services availability as ~99.9%.
- With distributed architectures, failures or latency in remote interactions are inevitable!
Use-case scenarios
Let's start with a simple use-case scenario where Customer A sends asynchronous requests to a middleware service that delegates the request to SaaS provider. SaaS provider only accepts 100 requests per second and rejects the rest of it that exceeds the limit with429 Too Many Request
Customer A sends 10 request per second and all the request are executed successfully without any issues by the SaaS provider.

Now, Customer A sends 200 request per second and SaaS provider accepts the first 100 and responds 429 Too Many Request for the next 100 requests. Middleware service will now retry the next 100 in 1 second using exponential retry(1s, 2s, 4s, 8s) algorithm.

Now, Customer A sends 1000 request per second and SaaS provider accepts the first 100 and responds 429 Too Many Request for the next 900 requests. Middleware service will now exponentially retry the next 900 requests. However, after the 4th attempt, SaaS provider now decides to block your account! Why?

Usually, SaaS providers comes with fair usage policy where if a client repeatedly violates number of requests for X minute/hour, account will get blocked. Below is an example with SharePoint Online :

Introducing Retry Simulator .NET
Recently, I wanted to dive in with different types of retries and understand the potential & dangers on using it. When searched around the internet, I could not find a simulator to go through & understand it. Therefore, I decided to create one using Polly .NET which a nuget package that wrapped to retry downstream services. I've been using Polly .NET for about 4 years now and it has never failed to disappoint me because the behavior is perfectly described in their developer documentation & syntactically sugar code.Retry simulator benefits
- Provide quicker way to try different types of retries
- Fires concurrent requests to server to simulate real-time retries. (Note - use it with extreme cautious)
- Local server provided without needing to create a server to simulate retries and understand how it works.
- Simulate retries with public server directly using simple JSON configuration file.
- Provides tokenization(e.g. correlation-id) which able to correlate requests from client & server.
- Writes results to project
/Result
directory in CSV format to visualize the data.
Evidence
Let's take a closer look with the graph as below. With the below sampling, I have executed 5 requests on a server that only accepts 1 request per second. Y-axis represents total duration it took for each series of retries. X-axis represents retries in series. Each block inside the X-axis represents request number.
Exponential retries does not automatically knows how to balance your requests & fire in a way to streamline the load to the downstream server. Additionally, there are time wasted as application is on idle where does not do anything else but just waits until it resumes where volcano is just about to get erupted when you have that many requests idling before firing to the downstream server.
Solution
1) Exponential retry with Jitter
A very simplest approach is to jitter the requests. Jitter means to provide some additional random delays for each requests that prevents it from colliding each other. The downsides to this is that it still does not solve the long idling time for request. If performance is not an issue, this can be a solution to mitigate.
2) AWS Decorrelated Jitter
Amazon have briefly described this issue here. In short - the algorithm randomizes between minimum & maximum delay which up to decimals. This can prevent correlated requests & long idling time for request that can be completed in milliseconds. I would definitely recommend this approach when retrying requests where concurrent clients are expected.
3) Polly community Decorrelated Jitter(NEW)
If you're support a service where high concurrency is expected, there seems to be some issues going on with AWS decorrelated jitter where in a high concurrency environment, the algorithm can correlate. With the sample below, I have fired 1000 requests which minimum delay is 1 millisecond & maximum delay is 500 milliseconds. The sampling is to identify whether in reality can AWS decorrelated jitter clash. In short - Yes! In a high concurrency environment, it can correlate. 2 requests indicates collision.
In closing
- Use the right retry strategies to mitigate your use case scenario.
- Exponential retries(without jitter) is helpful when no collision is expected to happen.
- With concurrent clients, collision can happen and AWS decorrelated jitter helps to solve the collision and able to succeed faster.
- Exponential retries can destabilize your service(by self ddos) and memory usage to spike up.