We wanted to share what happened with some downtime we experienced over the weekend, beginning the evening of October 31 and continuing into the morning of November 1. We're really sorry to those of you who were affected by this outage.
What happened?
A subset of users were routed to a web server that was failing intermittently (it kept dropping out of the server pool, then recovering briefly, then failing again -- also known as "flapping").
We have multiple monitoring systems, but because the server was going in and out, they didn't catch the full scope of the issue. This led to a delayed response from our team. Once we diagnosed and fixed the affected server, everything returned to normal.
Our communication
We're normally quick to update everyone when something goes wrong, but this time we let you down. A combination of the delay in detecting the issue, and the timing on Halloween night, meant that we didn't provide updates as we usually would. We're really sorry for the silence -- we know that made the outage more frustrating.
What we're doing now
We're making a few changes to try and avoid an issue of this kind reoccurring in the future, including extra monitoring and alerting, more redundancy in our notifications, and better detection of sudden increases in support volume that might indicate an undetected problem. We'll also make sure we communicate sooner if anything like this happens again.
Thank you
Thanks so much for your patience, and to everyone who reached out to let us know that something wasn't right. We're working to make sure that RTM stays reliable, and that we're communicative when we encounter issues like this.
If you’re still seeing any issues, please let us know. 💙