Request Hedging

Tail Latencies

Latencies are generally plotted and monitored at percentiles of p90, p95, p99 etc. But, as you move towards the higher end of the spectrum, the tail latencies keep increasing. The Tail at Scale paper by Google, in fact, talks in detail about this. It also mentions various techniques to counter it. One of the techniques discussed is Request Hedging.

The Notifications’ Service

The notifications’ service in Razorpay is a high throughput service, which receives requests to send out various types of notifications like webhooks, SMS, emails etc. at a peak rate of approximately 2000 requests/sec.

The Problem Statment

The clients of this notifications’ service had a strict timeout of 350ms in making any API call, but many a times, we would notice client timeouts which was not desirable. On further debugging, the tail latencies in pushing to SQS turned out to be the culprit. The p99.9 latencies would sometimes go upto 600ms!

Request Hedging

Before talking about the solution we employed, let’s look at how the Google paper defines Hedged Requests.

The Solution

While the Google paper talks about Hedged Requests primarily in the context of read requests, we used it in the write flow and piggybacked on the database and the cron job setup, which was already in place, to write the request to the database if SQS push doesn’t succeed within the defined timeout period. One of the drawbacks of this approach is that it can lead to duplicate deliveries, but that was acceptable as we anyway promise at least once delivery semantics.

The Implementation

The implementation involved Enqueue ing the message in a different thread or goroutine with a strict timeout something similar to this:

func (bq BaseQueue) enqueueWithSoftTimeout(msg string, timeoutInMs int, q Queue) (string, error) {

// the channel holding the result
c := make(chan struct {
id string
err error
}, 1)

// an async goroutine to enqueue the message
go func() {
id, err := q.Enqueue(msg)
c <- struct {
id string
err error
}{id: id, err: err}
}()

timeout := time.Duration(timeoutInMs) * time.Millisecond

// wait till timeout for the result to appear in the result channel
// else return
select {
case result := <-c:
return result.id, result.err
case <-time.After(timeout):
return "", fmt.Errorf("enqueue timed out after %d ms", timeoutInMs)
}
}

Why not use http transport timeouts?

Now, this is an interesting question and an alternate approach to solve this problem could have been to use http transport timeouts like Dialer Timeout, TLS Handshake Timeout and ResponseHeaderTimeout.

  1. The initial connection establishment which includes fetching IAM credentials, DNS resolution, SSL handshake and connection establishment, even before the payload can be sent and acknowledged, within 350ms would have been a close call.
  2. It can even lead to connections never getting established in case of some minor degradation in any of the aforementioned phases.
  3. Keeping relaxed transport timeouts along with a strict timeout in the application helps to mitigate this issue.

--

--

A curious engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store