We are having issues with the internal network of some of our servers

Incident Report for Transifex

Postmortem

When rules increase the load

As part of improving our Translation History mechanism, we are redesigning the whole component for the current architecture is reaching it limits thanks to Transifex's growth.

In order to make the final switch for our users to the new system, we need to run both systems in parallel and observe the behavior under real load. The new system makes use of rules (as these are implemented in Postgres). After being tested for quite some time in our staging environment it was time to move to production and enable it for a small group of internal users.

Unfortunately, after enabling a rule on updates the database load increased and brought the system to its knees. The servers that handle users added to the load since they also kept hammering the system with queries, while it was being unable to sustain the disproportionately high load that was generated. The rule alone generated more traffic than x10 the current users of Transifex could.

Restarting the database and dropping the rule in question returned operations back to normal.

Troubleshooting this took a bit more than would be expected because of the previous outages that we had experienced that were network and cloud related. We had to put ourselves on the wire (ping, curl and tcpdump) and make sure that this time the Network was not part of the problem.

We're working towards making the rule system work properly by implementing a replay mechanism that will allow us to test it on our staging environment using loads that closely match the real use of our system.

Posted Oct 17, 2014 - 09:53 UTC

Resolved

Turns out the slowness was caused by slow replies from one database server. We took the appropriated actions and things are back to normal now.

Posted Oct 15, 2014 - 15:25 UTC

Investigating

Investigating the source of the issue.

Posted Oct 15, 2014 - 14:45 UTC