Investigating errors coming from our queuing system

Incident Report for Transifex

Postmortem

Due to the memory usage in the version of RabbitMQ that we were using, we performed an upgrade to our cluster to the latest version. Memory usage patterns stabilised, but this did not come without problems.

One of the queues that we were placing tasks at was not being consumed at all. Using different versions of client libraries and workers also did not help, and no matter what, the tasks stayed there and the queue kept growing without any work being picked up.

The decision was made and we rolled back to an older version of RabbitMQ. The roll back did not happen without problems itself, but thanks to our internal documentation, it was a quick recovery. During this downgrade, the following systems were affected:

Instability for a few minutes in the website.
Some notifications did not get send since the RabbitMQ upgrade last week. This was the reason for downgrading.
Some Tx Live publish procedures didn't happen.

We are still investigating why that particular queue kept growing without being consumed at all, and when we get any further findings we will share them with you in this space.

Thank you for your support in Transifex!

Posted Jul 21, 2014 - 14:51 UTC

Resolved

The cluster is behaving. Closing incident

Posted Jul 21, 2014 - 14:40 UTC

Monitoring

RabbitMQ now behaving. Monitoring.

Posted Jul 21, 2014 - 14:32 UTC

Investigating

We are currently investigating this issue.

Posted Jul 21, 2014 - 14:05 UTC