Partial Outage Of Application
Incident Report for Uberflip
Postmortem

What Happened

On April 10th, starting at around 9:40 AM EDT, we experienced a configuration issue with our production database that caused it to become overloaded. The affected database supports most major Uberflip product functionality, so this problem resulted in a product outage that caused Hubs and the application backend to become inaccessible.

We took immediate action to investigate and resolve the outage, and full functionality was restored to most customers by 10:12 AM EDT. Some customers continued to be affected after this time due to their Hubs being cached; after a full purge of our caches, the outage was fully resolved for all customers by 10:31 AM EDT.

We sincerely apologize for the impact that this interruption in our service has had on your business. Following a thorough investigation into what caused this issue, we would like to share with you what we have learned, and what steps we are taking to prevent it from happening again.

Background

Our platform has a collection of processes called CTA processors to handle CTA submissions and send them to the right Marketing Automation Platform (MAP). Individual CTA processors use server memory as they work, and the amount they use increases over time.

To prevent our servers from running out of memory, we periodically terminate existing CTA processors and replace them with fresh instances. This frees up all the memory the terminated CTA processor had been using. To determine when a CTA processor should be terminated and replaced, we use capacity management rules. These rules basically tell the server, “when a CTA processor has handled x number of CTA submissions, terminate it and start a new CTA processor”.

Some weeks prior to this incident, we implemented a change to how CTA processors work. Originally, there was just one type of CTA processor which handled CTA submissions for all MAPs. This could cause a situation where, if there was a problem with any one of the MAPs, the CTA processors could stop working for all the MAPs. To prevent this from happening, we decided to split the all-purpose CTA processor type into a set of separate, MAP-specific CTA processor types. As part of this change, we also made the decision to retain the capacity management rules we already had in place, on the basis that they had been effective in the past.

What Caused The Issue

On the morning of April 10th, the Marketo-specific CTA processors suddenly began consuming memory at a much higher than normal rate. This caused our servers to run out of available memory and become unresponsive, resulting in a service outage.

While we had capacity management rules in place to prevent this, they were ineffective in this case. These rules are designed to free up memory by terminating CTA processors after a specified number of CTA submissions. In general, this works because a CTA processor’s memory usage usually increases at about the same rate as the number of CTA submissions it has processed. In this instance, however, memory usage was rising at a higher rate than CTA submissions. Since the rate of CTA submissions is directly tied to the mechanism used to release memory, this led to a situation where memory was being consumed more quickly than it could be freed up.

Our decision to retain the existing rule configuration when we moved to the new CTA processor types was a major factor in causing this issue. As we have now learned, the thresholds defined in the existing rules were set too high for the new CTA processors. An old CTA processor was much busier than its new, more specialized counterpart, so using the same termination threshold for both meant that the new CTA processors had a significantly longer lifespan. Consequently, the rate at which memory could be freed up was reduced, which rendered the capacity management rules ineffective in this situation.

What We’ve Learned, and How We’re Responding

Ensuring the reliability of our platform is one of our highest priorities. We make every effort to anticipate and prevent problems before they happen, but we also know that we’re not perfect. Whenever an issue does occur, we do our best to learn from it by conducting a comprehensive investigation into its causes. Here’s what we’ve learned, and what we’re doing to ensure that this doesn’t happen again.

The most important thing we discovered was that our capacity management rules were misconfigured. We did not take the lower throughput of the new CTA processors adequately into account when setting the capacity management rules, instead relying on the assumption that what worked in the past would continue to work after we made the change. This assumption was proven to be faulty, and in response, we’ve adjusted the capacity management rules for CTA processors. This change should ensure that CTA processors are terminated at a sufficient rate that our servers can’t run out of memory, even if unexpected spikes in memory usage occur.

We also identified room for improvement in our monitoring and alert systems. We have now reconfigured these systems to alert us to this type of behaviour much sooner. This should allow us to take preventative action and avoid an outage before it happens, even if the capacity management rules should fail.

Finally, we could have been quicker to identify caching as the cause of lingering issues for some of our customers. We have now added cache clearing to our standard procedures for outage recovery, in an effort to further reduce their impact as much as possible.

Once again, we are deeply sorry that this issue occurred. We’re committed to making sure it can’t happen again, and we’re grateful for your patience and trust.

Posted May 08, 2019 - 15:43 EDT

Resolved
A permanent fix was deployed a short while ago and all systems remain fully operational. We will continue to monitor but do not expect any further interruptions to service.
Posted Apr 10, 2019 - 18:37 EDT
Update
All services remain operational. A root cause of the issue has been identified and a permanent fix is being prepared for deployment.
Posted Apr 10, 2019 - 14:23 EDT
Monitoring
Problem has been identified and corrected. Services should be fully operational again. The team will continue to monitor for stability.
Posted Apr 10, 2019 - 10:20 EDT
Update
We are continuing to investigate this issue.
Posted Apr 10, 2019 - 09:57 EDT
Investigating
We are currently experiencing an issue impacting access to the backend/management tools. Some impact to front end (hubs) is also intermittent. The team is actively investigating.
Posted Apr 10, 2019 - 09:52 EDT
This incident affected: Content experiences (Hubs, Flipbooks) and App.