Cloudflare’s Outage – Key Takeaway, Design for Failures
Downtime and outages: are they common? While downtime and inaccessibility of small sites go unnoticed, the awareness of massive outages spreads faster and makes it into the headlines.
The recent internet outage has taken down many of the biggest sites, including Amazon, Discord, Canva, Crunchyroll, and Medium, and it is learned that due to a change pushed by Cloudflare. The key point to notice here is that downtime is inevitable, and this is not the first such news, nor will it be the last.
Any service will go down. When we presume that services always be up 100% of the time, that is when we run into issues. This is where what Indusface refers to as Design for failure plays a critical role and wants to look at a fundamental shift in thinking to ensure the impact of the failure is restricted only to the services provided and nothing more.
This is the essential issue with this outage where the scope of its impact went beyond just the non-availability of the service (security and CDN) and the business site itself going down while Cloudflare was recovering.
Could it Happen Again?
Cloudflare’s big outage on 21st June morning impacted various sites and caused login difficulties and crashes on multiple services. According to the Cloudflare blog, it took them more than an hour to recover completely. This was the most massive outage, but the NET infrastructure service provider also experienced similar issues in 2020.
Cloudflare is not the only CDN (Content Delivery Network) provider. Other high-profile companies, including Fastly and Akamai, experienced a service outage. Downtimes are not uncommon. The inevitable fact is that distributed systems are complex, and though a lot of energy is put into secure deployment, slippages happen, and an outage cannot be avoided altogether.
How is AppTrana Prepared for Outage and Downtime?
“The Cloudflare outage has necessitated a rethink on how such outages, which could cripple business operations temporarily, can be overcome. More often than not, while choosing or building a service, there is a focus on the kind of features and capabilities the service offers. However, it is important to evaluate the service provider/vendor’s ability to support you in a service outage.” – Venkatesh Sundar, Founder and CMO of Indusface.
At Indusface, our experts believe that if we should fail, let’s fail gracefully, with a Design for failure mechanism. By falling in the right way, we can roll back and minimize the consequences. Careful planning, proper architectural design, and quicker resolution to failure can bring you back quickly to meet your uptime requirements.
Design for Failure
With the fail-safe mechanism, you can choose whether to remain available or secure. By default, if AppTrana can’t verify the request, it is considered a malicious request. It blocks the request.
Our WAF SLA is 99.99% uptime. However, there is always a slight probability of disruption due to unexpected technical issues. Besides ensuring availability, we approach failure as inevitable. We plan it accordingly with the intent to minimize the impact of the failure to only the services we provide and not to the website itself going down. Our WAF architecture prepares for failover by adding a separate function known as bypass fleet.
For a recent outage like the one Cloudflare experienced, it would be possible to enable the bypass feature on the fly to temporarily forward all the requests to the backend servers in the target group. This feature enables you to deliver reliable customer services in such outages restricting the impact of the outage only by not having the acceleration and security services not available during that time. Having the entire site go down is a larger issue and gap in “Design for Failure” in this case.
Know Why Enterprises Choose AppTrana Over Cloudflare
A Little Background on The Need for Bypass Fleet
The problem with cloud WAF is that though traffic through the cloud would be protected, if someone knows the server IP, they can reach directly to your server, bypassing WAF configurations. To avoid this, we provide origin protection in AppTrana.
Every onboarding, we let the customer know the set of IPs through which they will get requests. This IP range could be whitelisted in their network to protect its origin. But this also brings operational challenge; if for any reason like what happened to Cloudflare and customer needs to route their traffic to origin directly, they need to get these IPs whitelisted by the IT team, which will be a complex time talking process in big organizations.
It is to avoid this that we have built a bypass fleet. Bypass fleet is a redundant architecture in our Infra, a simple TCP proxy redirecting traffic through the same IP’s customer has whitelisted. So, in case of any failures, sites can be bypassed to ensure availability while WAF is bought back up. We are providing customers options on how they want to react during failures. This is what we call Design for failure.
This feature has had other natural side effects where this feature is also used in day-to-day processes when a customer wants to isolate any problem during changes at the origin. This allows us to fail gracefully and give some control to the customer for them to decide how they want to react in such outages. This is one example of how we have built our system ground up by thinking about what we call Design for failures.
Continuous Monitoring for Failure
IT teams should not be just aware of how their server look when uptime is 100%. They should also predict changes to the environment contributed by downtime incidents. Our continuous monitoring tracks the website continuously and alarms instantly in case of an outage or downtime. Besides, our real-time visualization and reports aid you see future states with the proper context required to plan for failure.
These features enable you to deliver secure, reliable, and highly responsive IT services.
If you’re serious about your service availability, you should consider a massive paradigm shift. Invest in multiple layers of security to ensure data integrity, proactively design your system to fail, have effective recovery plans, and achieve high uptime, which keeps you moving forward.
Look at every system in your architecture and think deeply if these systems have been designed for failure and if it gives you enough control to react quickly when things fail to restrict the scope of the outage to be only at the service level and nothing beyond.