Investigating errors coming from our queuing system
Incident Report for Transifex
Postmortem

Due to the memory usage in the version of RabbitMQ that we were using, we performed an upgrade to our cluster to the latest version. Memory usage patterns stabilised, but this did not come without problems.

One of the queues that we were placing tasks at was not being consumed at all. Using different versions of client libraries and workers also did not help, and no matter what, the tasks stayed there and the queue kept growing without any work being picked up.

The decision was made and we rolled back to an older version of RabbitMQ. The roll back did not happen without problems itself, but thanks to our internal documentation, it was a quick recovery. During this downgrade, the following systems were affected:

  • Instability for a few minutes in the website.
  • Some notifications did not get send since the RabbitMQ upgrade last week. This was the reason for downgrading.
  • Some Tx Live publish procedures didn't happen.

We are still investigating why that particular queue kept growing without being consumed at all, and when we get any further findings we will share them with you in this space.

Thank you for your support in Transifex!

Posted Jul 21, 2014 - 14:51 UTC

Resolved
The cluster is behaving. Closing incident
Posted Jul 21, 2014 - 14:40 UTC
Monitoring
RabbitMQ now behaving. Monitoring.
Posted Jul 21, 2014 - 14:32 UTC
Investigating
We are currently investigating this issue.
Posted Jul 21, 2014 - 14:05 UTC