Slow Editor and Team pages

Incident Report for Transifex

Postmortem

At 1:44PM UTC we marked the queuing system issue as resolved after monitoring it for an hour. Right after flipping the switch to "All Systems Operational" we noticed a surge of new users flowing to Transifex and visiting pages, raising the load on our servers again.

What happened

This time it wasn't the worker servers that were facing issues but rather the application servers themselves. The problem shifted from a high load on pages which had heavy background operations to certain pages which had complex queries. This resulted in some slow pages on the website that were accessed by thousands of users.

We researched the slow pages through New Relic and distributed a number of different optimizations across the engineering team.

One of the optimizations involved a method in the Web Editor which rendered a drop-down listing your team members. While this worked well for teams with dozens to hundreds of users, it failed for teams with thousands of users, especially when the page is actually accessed by thousands of users. The optimizations were deployed and the load was dropped to normal levels.

Plans for the future

Continuing from where we left off with the queuing system issue:

We are investing more time to optimize pages which are accessed by many people at the same time. This includes caching more elements of the page. A good example is the number of collaborators on the dashboard, which can be cached in a smarter way.
We're funneling more New Relic reports to a larger part of our team to have more eyes look at data which are suspiciously heavy on the page.
We're focusing on making our pages faster with time-specific goals for the upcoming months.

Once again, we would like to apologize for the impact this outage had to you and your team's operations. Our team is dedicated to improving the user experience and the quality of our operations. Thanks for your patience and continual support of Transifex!

Posted May 02, 2014 - 22:55 UTC

Resolved

All back to normality.

Posted May 02, 2014 - 17:44 UTC

Monitoring

We have deployed a set of improvements on our services. We're closely monitoring their performance.

Posted May 02, 2014 - 17:42 UTC

Update

We're now working on a number of new improvements in certain complex pages which are accessed by tens thousands of translators at the same time.

Posted May 02, 2014 - 15:09 UTC

Identified

Some more load spikes came up just as we closed the ticket. Investigating.

Posted May 02, 2014 - 13:48 UTC