Intermittent Connectivity Issue
Incident Report for Uberflip
Postmortem

What Happened

On September 23rd, beginning around 9:30 AM EDT, we began to experience database performance problems which affected all parts of the Uberflip platform. As a result of these issues, customer Hubs and the Uberflip application backend were intermittently inaccessible for a period of approximately 14 minutes.

We immediately took action to identify the cause of these issues. By 9:40 AM our systems began to recover, and by 9:44 AM the outage was fully resolved, with all parts of the Uberflip platform functioning normally once again.

We sincerely apologize for the impact that this interruption in our service may have had on your business. Following a thorough examination of the causes of this issue, we would like to share with you what we have learned, and what steps we are taking to prevent it from happening again.

What Caused the Issue

Starting at around 9:25 AM, our servers suddenly began to receive an extremely high volume of traffic over a short amount of time. This unusual volume of requests caused the load on our primary database to exceed its capacity, at which point it failed over to our backup database. The backup database also rapidly reached its capacity limit, causing it to revert back to the primary database (which had recovered in the interim). As the abnormally high volume continued, this cycle repeated several times, with each database becoming overloaded, failing over, recovering, then becoming overloaded again and so forth.

Each time the platform switched between the two databases, it took a few moments for all parts of the application to reconnect. At the same time, any requests in progress at the time of the switch were lost. Combined, these factors are what caused the intermittent inaccessibility and lack of responsiveness in the Uberflip application and customer Hubs.

What We’ve Learned, and How We’re Responding

As with all incidents that affect our platform and its reliability, we have performed a thorough investigation. Here’s what we’ve learned, and what we're doing to make sure it can't happen again.

The root cause of this incident was a sudden, sustained increase in the amount of traffic to our platform. This traffic was highly unusual: it was significantly higher than the amount we would typically expect, ramped up almost instantly, and stopped just as suddenly. We've investigated the source of this traffic, and have determined that there was probably no malicious intent behind it. Given that this kind of anomalous traffic would not occur organically, we believe this should be a very rare occurrence. Nonetheless, we are of course taking measures to prevent our platform from being affected should something similar happen in the future.

Primarily, we will be implementing rate limiting. Rate limiting will allow us to automatically control the volume of traffic to our servers, and prevent unusual bursts of traffic like this one from impacting the reliability of our platform. While we believe rate limiting will be effective in this regard, we are also aware that it could potentially affect legitimate traffic. As a result, we're taking a considered approach, and are currently investigating what the most appropriate limits are before setting them.

At the same time, we are also looking at whether there are any ways for us to improve our existing warning systems so that we can detect and respond to anomalies faster. In this instance our technical teams very quickly pinpointed the cause of the issue and were ready to respond, but the anomalous traffic actually ceased by itself before they had a chance to do so. That said, while we had the capability to resolve this incident rapidly, our focus is of course on preventing it from happening in the first place. To that end, we are currently examining how we can identify and react to similar situations before our customers are affected.

Finally, we are also performing a thorough review of our application, database, and network architecture, as we do following all significant incidents. In this particular review, we will be placing a specific focus on how we can make the entire platform more resilient to issues such as this one.

Once again, we are deeply sorry that this issue occurred. We are committed to making sure it can’t happen again, and we appreciate your patience and trust.

Posted 24 days ago. Sep 25, 2019 - 14:14 EDT

Resolved
This incident has now been resolved. All systems are operating as expected.
Posted 26 days ago. Sep 23, 2019 - 18:41 EDT
Monitoring
The platform has now stabilized and the team continues to monitor for any re-occurence.
Posted 26 days ago. Sep 23, 2019 - 10:20 EDT
Investigating
We are observing an issue that is causing intermittent connectivity issues with the platform. The team is actively investigating and indications are that things are now stabilizing.
Posted 26 days ago. Sep 23, 2019 - 09:50 EDT
This incident affected: App.