when a peace of cloud goes down… modern cloud architecture problems

The giant host provider AWS went down at 2021–12–07 15:23:00 UTC.

So many sites, and services were unavailable because of it:

1.Tinder ( no more sex 😱)
2. IMDB ( no more bad movie rating )
3. Ring ( no more smart home )
4. Disney+
( no more expensive cartoons )
5.
Instacart ( you have to MOVE, consume calories in order to buy more calories. Such a disaster )
6.
Venmo ( i guess you have to use PayPal like your millenial parents, ughh 😩)
7.
Robinhood ( unable to lose money on meme stocks, this is ADHD without adderall )
8.
Roku (who even uses Roku )
9. Kindle (ok i agree, kindle is a good service, expensive, but good)

As well as popular games like

1. Clash of Clans
2. Destiny 2
3. League of Legends
4. Dead by Daylight
( you should read a book !!! ohh wait, Kindle is down to…)

AWS up until now did not release the reason for the outage.

What can we learn?

As developers first of all we need to start thinking in terms of multizone services, specially if we deliver content to the entire globe.
This is a great way to also improving the speed to our services. Less distance between the datacenter and the user, will provide the user with information that he needs faster.

We can also learn from AWS engineers mistakes (as soon as AWS release the reason of the outage) and make sure that we don’t fall in the same trap.

Lets have a look at the 2021 list of cloud providers outages

Google Cloud Platform

I will start with GCP posting outages screenshots, but if you want to check yourself or get into detail please check google list of 2021 outages provided by google it self

Cloud Data Fusion 4 hours of outage

about 99.950% availability

Cloud Developer Tools

about 99.000% availability

Cloud Endpoints

Cloud Filestore

Cloud Firestore

Cloud Machine Learning

Cloud Memorystore

Cloud Run

Cloud Run is my absolute favorite since it runs containers and it can scale to 0, so you don’t spend money if it does not need to run.

Cloud Spanner

Google App Engine

This service is very similar with Heroku so many developers prefer this service. Having so many outages is not very nice.

Google BigQuery

This is one of the best services GCP offers. No other cloud provider offers a service that can process such a huge amount of data so cheap.

Having so many outages it is for sure something many clients did not loved.

Google Cloud Bigtable

Google Cloud Composer

Google Cloud Console

This is the front-end of the service.

Google Cloud Dataflow

Google Cloud Dataproc

Google Cloud Datastore

Google Cloud DNS

Google Cloud Functions

Google Cloud Infrastructure Components

Google Cloud Networking

The most trivial service in a cloud infrastructure. Everything is connected through the Cloud Network, if the network goes down, almost everything stops working.

😅 One screen shot is not enough to show the amount of outages the network had

I think this is pretty bad

Google Cloud Pub/Sub

Google Cloud Scheduler

This is a services that works like a cron-job, calling https endpoint at indicated time. This is once again the only cloud provider that offer such a service

Google Cloud SQL

this is quite embarrassing, while i was writing down the article, the Cloud SQL service has an ongoing outage at global scale.

Google Cloud Storage

Google Cloud Support

Google Cloud Tasks

Google Compute Engine

Google Kubernetes Engine

Identity and Access Management

Operations

WOW 😱 and this was just GCP

Amazon Web Services

Lets see aws

AWS has a sh**y way of keeping track of the outages, and i won’t try to post images of it here, because is such a pain.
I leave you the link to the page and you can check it yourself

In order to have a better list of outages, i extracted the data from the page, converted into a csv and opened into Google Sheets ( here is the link to the table )

And here is the list of AWS outages.
And i thought that GCP had way to many.

Microsoft AZURE

Microsoft azure has a similar system for the status, just like AWS so i will just leave you with a link to the page so you can check it yourself.

So how much can we actually trust those cloud providers?

Well we don’t really have a choice, do we?

They offer us compute power, network, storage and so much more, all over the globe with no fatigue.
We can configure a cluster with a few CLI commands or with a few clicks.
We can setup machine learning and train our data without needing to buy expensive hardware.

So being down a few hours a year i think is acceptable.

What can we do to avoid it?

I will not get into detail about how to build such architecture, but if you search on AWS ( or any cloud provider ) they have plenty of documentation of how to do it.
Here AWS explains how to build a “Multi zone architecture with an internet gateway

It is hard to avoid such a disaster when all your services relay on 1 cloud provider, you basically are at its mercy, or even worse when all your services are on 1 zone only.
If we stop for a second and think about it, not only human mistakes can take down a zone, not only a distributed hacker attack, but also natural events like floods, earthquakes or hurricanes.
Do we really wanna keep all of our services in 1 unique zone, in one datacenter?

Cloud providers try to achieve a 5 * 9 availability per year

screenshot from “Achieving “five nines” in the cloud for justice and public safety”

Cloud providers try to offer 5 nines availability per year, but only some services manage to provide it.

For most of the services the cloud providers publicly declare a 4 nines availability.

But this year cloud providers, for some services, got only 3 nines ( or even worst ). Is enough to scroll through my screenshots to make an idea.

Could this go even worse?

ohh yes, i love the outage Facebook, ohh i meant META, had this summer.
They actually manage to close themselves outside of their own Data Centers.

To fix it, i imagine the scene from “The Martian”, where Facebook, i mean META, engineers are trying to remote communicate with a guard from a remote datacenter, and try to regain access to the servers.😅

screenshot from youtube

Imagine this happening to AWS. Half the internet will be down.

What a disaster…

A part of me wants to live that moment.

Another part of me prays that never happens.

Conclusion

Bad things happen, sometime is our fault, sometimes is someone elses fault, sometime is an natural event.

All we can do is try our best to never let this happen.

This will not be the last event. We may see events even during the holidays.

2022 is right around the corner and is going to be a year full of events, i can feel it.

Thank you for reading and as always,
If you enjoyed please leave a few claps 👏👏👏
As it really help me allot.

Have a lovely day.

--

--

--

Full stack developer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Cheapskate’s Journey to On-Demand Load Tests with Locust

Build a Me-experience in Microsoft Teams

Devlog — Week 9

Interview with a Senior Developer at an AI Startup

Abell World Update #3

Be Legit in Git

$ECL — eclipseum.org AMA Held December 3rd @ 8pm EST

Flexbox: Use or Not to Use?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Brainless

Brainless

Full stack developer

More from Medium

Another day another AWS outage

What is DevOps and What are the Tools in DevOps

Just another day at Engineering

Su-do command: how to avoid the disastrous pitfalls of going root!