Canary Deployment for Queue Workers

Canaries were once regularly used in coal mining as an early warning system. Toxic gases such as carbon monoxide, methane or carbon dioxide in the mine would kill the bird before affecting the miners. Signs of distress from the bird indicated to the miners that conditions were unsafe.

A very well understood and practiced aspect in SDLC is the deployment of your code or deployment of your service. The world of microservices mostly follows rolling deployments, where the new version of the code is gradually rolled out to all instances of the service without requiring downtime.

Canary deployment is a pattern for rolling out releases to a subset of users or servers. The idea is to first deploy the change to a small subset of servers, test it, and then roll the change out to the rest of the servers. The canary deployment serves as an early warning indicator with less impact on downtime: if the canary deployment fails, the rest of the servers aren’t impacted.

The basic steps of a canary deployment are:

  1. Deploy to one or more canary servers.
  2. Test, or wait until satisfied.
  3. Deploy to the remaining servers.

The test phase of the canary deployment can work in many ways. You could run some automated tests, perform manual testing yourself, or even leave the server live and wait to see if problems are encountered by end-users. In fact, all three of these approaches might be used.

The monitoring of a canary deployment can also be automated by comparing deviations in key metrics of the canary server with that of a baseline server.

At Razorpay, we use spinnaker for deployments and kayenta for canary analysis.

The canary deployment model with nodes or pods serving web or HTTP/gRPC traffic is quite straight forward and it includes the following.

  1. A predefined and fixed number of canary web nodes: These nodes get the new version of the code deployed first.
  2. A predefined and fixed number of baseline web nodes: These nodes have older versions of the code running and they are used for comparing metrics with the canary nodes.
  3. Regular nodes: They serve the majority of the traffic for the service and can be scaled up or down.

The canary deployment flow is illustrated in the diagram below which is self explanatory in itself.

PS: Database schema migrations cannot be canary tested as they are atomic and any changes to the schema apply to all types of nodes alike, canary or non-canary.

Canary deployment for an application that makes use of async processing through queues and queue workers becomes a little complex. Before we go into canary deployment for queue workers, let's discuss a few points

Standard Deployment

How should a standard deployment involving web and worker nodes happen?

Should the web nodes be deployed first or the worker nodes?

There are actually just two options here:

  1. Deploy the web nodes first, and then the worker nodes.
  2. Deploy the worker nodes first, and then the web nodes.

With web nodes getting deployed first, there is a possibility that messages produced by the new version of code might be consumed by the old version of code running on the workers. This imposes an additional constraint that the code should always be forward compatible. This, in my opinion, is hard to achieve.

On the other hand, deploying all the worker nodes first before deploying the web nodes makes this easier. This is because, in this case, the worker code has to be only backward compatible, which is easier to achieve than forward compatibility.

Canary Deployment with a Dedicated Canary Queue

Lets now talk about canary deployment for queue workers and one of the approaches available is the use of a dedicated canary queue. In this approach, the deployment happens as follows:

  1. Deploy the canary worker node.
  2. Compare the canary metrics of canary worker node with that of baseline worker node.
  3. Proceed ahead if canary analysis passes, else revert the canary worker node.
  4. Deploy the canary web node.
  5. Compare canary metrics of web and worker canary nodes with that of baseline nodes.
  6. Proceed if all fine, else revert the canary deployment.
  7. Deploy the rest of the worker nodes.
  8. Deploy the rest of the web nodes.

Pros:

  1. A dedicated canary setup which ends up with deterministic testing of new code end to end.
  2. It also helps in testing backward compatibility with 2 phase canary analysis.

Cons

  1. Additional infrastructure components in terms of dedicated canary queues.
  2. Additional logic on web nodes to selectively push messages to a canary queue. This becomes more complex when you have multiple queues for various use cases.

Canary Deployment with a Common or Shared Queue

Another approach to the canary deployment of worker nodes is to route all messages through a common queue. In this approach, the deployment happens as follows:

  1. Deploy the canary worker node.
  2. Compare the canary metrics of canary worker node with that of baseline worker node.
  3. Proceed ahead if canary analysis passes, else revert the canary worker node.
  4. Deploy a new version of code to all the worker nodes if canary analysis passes, else revert the canary worker node.
  5. Deploy the canary web node.
  6. Compare the canary metrics of canary web node with that of the baseline web node.
  7. Deploy the new version of code to all web nodes if canary analysis passes, else revert the canary web node.

Pros

  1. Tests backward compatibility of worker code when consuming older messages.
  2. Simplistic setup with no additional and dedicated canary queues.

Cons

  1. Does not test the happy flow deterministically i.e newer messages being consumed well by the new version of the worker. Although canary analysis 2 in phase 2 does test this scenario, but the effectiveness would only depend on the volume of new messages coming to the canary and the baseline workers, which would be low due to canary web nodes getting a small percentage of traffic and this small percentage of traffic is again distributed amongst all the available worker nodes.

You can choose any of the above mentioned canary deployment strategies based on your requirements and needs. The dedicated queue approach is good for testing both the happy flow and the backward compatibility while the common queue approach is only good for testing backward compatibility in consuming messages. Testing backward compatibility is important as the deployment happens in a rolling fashion and a queue can always have residual messages from the older code. The shared queue approach is preferable when you need simplicity in your code/infrastructure and you are confident that the happy flow is well tested through your functional/integration tests. The dedicated queue approach would test all the scenarios but comes with its own complexity.

Happy Deployments!

Read more at https://www.varlog.co.in/index.html

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store