At 1:44PM UTC we marked the queuing system issue as resolved after monitoring it for an hour. Right after flipping the switch to "All Systems Operational" we noticed a surge of new users flowing to Transifex and visiting pages, raising the load on our servers again.
This time it wasn't the worker servers that were facing issues but rather the application servers themselves. The problem shifted from a high load on pages which had heavy background operations to certain pages which had complex queries. This resulted in some slow pages on the website that were accessed by thousands of users.
We researched the slow pages through New Relic and distributed a number of different optimizations across the engineering team.
One of the optimizations involved a method in the Web Editor which rendered a drop-down listing your team members. While this worked well for teams with dozens to hundreds of users, it failed for teams with thousands of users, especially when the page is actually accessed by thousands of users. The optimizations were deployed and the load was dropped to normal levels.
Continuing from where we left off with the queuing system issue:
We are investing more time to optimize pages which are accessed by many people at the same time. This includes caching more elements of the page. A good example is the number of collaborators on the dashboard, which can be cached in a smarter way.
We're funneling more New Relic reports to a larger part of our team to have more eyes look at data which are suspiciously heavy on the page.
We're focusing on making our pages faster with time-specific goals for the upcoming months.
Once again, we would like to apologize for the impact this outage had to you and your team's operations. Our team is dedicated to improving the user experience and the quality of our operations. Thanks for your patience and continual support of Transifex!