6. Scaling, reliability and backups

Your app is live — congratulations! But a new set of questions appears: What happens when 10,000 users visit at the same time instead of 10? What if the server crashes at midnight? What if someone accidentally deletes the database? In this lesson you will learn how cloud infrastructure is designed to scale with demand, stay reliable under pressure, and recover from failure.

Scaling means making your system able to handle more work — more users, more data, more requests per second. There are two fundamental ways to scale:

Vertical scaling (scaling up)

You replace your server with a bigger, more powerful one — more CPU cores, more RAM, faster disk. Think of it like upgrading Abebe's coffee shop from a single espresso machine to a commercial-grade machine that can make ten cups at once. There is a limit to how big one machine can get, and bigger machines cost a lot more.

Horizontal scaling (scaling out)

Instead of one big machine, you add more machines and share the workload across all of them. Think of Abebe opening five identical coffee stalls across Addis Ababa instead of upgrading the one stall. Each stall handles its own customers; a coordinator (called a load balancer) directs each new customer to the least-busy stall.

Most modern cloud apps prefer horizontal scaling because:

• There is no hard upper limit — you can keep adding servers.

• If one server fails, the others keep running.

• Cloud providers let you add or remove servers automatically based on current traffic.

Auto-scaling is a cloud feature that watches your server's CPU and memory usage in real time. When traffic spikes — for example, when Ethio Telecom Learn sends a push notification and thousands of learners open the app at once — auto-scaling automatically starts new server instances to absorb the load. When traffic drops back to normal, it shuts those extra instances down so you are not paying for capacity you no longer need.

A load balancer sits in front of your servers and does two things:

1. It distributes incoming requests evenly across all your server instances, so no single server gets overwhelmed.

2. It performs health checks — it regularly pings each server to confirm it is responding. If a server stops responding, the load balancer stops sending traffic to it and waits until it recovers or is replaced.

Without a load balancer, adding more servers does not help — users would not know which server to connect to. The load balancer is the single entry point that hides all the complexity behind it.

In practice: managed platforms (Vercel, Render, AletCloud App Hosting) set up load balancing and auto-scaling for you automatically. On a VPS you configure them yourself, or use a tool like Nginx or Caddy as a load balancer.

Traffic spike handled by auto-scaling: as the number of users rises, new server instances spin up automatically; when traffic falls, instances are removed to save cost.

Reliability means your application keeps working even when individual components fail. In the cloud, reliability is built through redundancy — having more than one copy of a critical component so that one failing does not take down the whole system.

Availability zones (AZs)

Major cloud providers split their data centres into separate availability zones — physically distinct buildings in the same region, each with its own power supply and network connection. If a fire breaks out in one building, the zone in the next building keeps running.

If you deploy your app to multiple availability zones, a single zone failing does not take your app offline. Load balancers automatically route traffic away from the failed zone.

Replication for databases

For databases, reliability usually means keeping a primary (the database that handles writes) and one or more replicas (copies that stay perfectly synchronised). If the primary server fails, the managed database service promotes a replica to become the new primary — often within seconds — so your app keeps working.

Uptime and SLAs

Cloud providers publish a Service Level Agreement (SLA) that promises a minimum percentage of uptime, such as 99.9% (which means at most about 8.7 hours of downtime per year) or 99.99% (less than an hour per year). Higher availability costs more and requires more redundant infrastructure.

Reliability keeps your app running when hardware fails. Backups protect you from a different category of problem: data loss caused by human error, software bugs, ransomware attacks, or accidental deletion.

Consider this scenario: Sara's small injera-order platform has been running for six months. A developer on her team runs a database migration script with a typo — and overwrites six months of customer order data. Without a backup, that data is gone forever. With a recent backup, she restores from yesterday's snapshot and loses only a few hours of data.

The 3-2-1 backup rule

The standard best practice for backups is called 3-2-1:

• 3 copies of the data (the original plus two backups).

• 2 different types of storage media (for example, one on a local disk and one in the cloud).

• 1 copy stored off-site or in a different region (so a fire or flood at one location cannot destroy all copies).

In cloud practice this often means:

• Automated daily database snapshots stored in object storage.

• The database snapshot copied to a different cloud region or provider.

• Application code always stored in a git repository — that is itself a backup.

Point-in-time recovery (PITR)

Managed database services like AWS RDS, AletCloud Managed Databases, and PlanetScale support PITR — the ability to restore your database to the exact state it was in at any moment in the past (for example, 'restore to 14:32:00 yesterday, one minute before the bad migration ran'). This is much more powerful than a daily snapshot because you can recover with almost no data loss.

Every serious cloud application should have a disaster recovery (DR) plan — a written procedure for what to do when things go badly wrong (a data centre fire, a major cyberattack, or a catastrophic software bug). Two key metrics define how good your DR plan is:

RPO — Recovery Point Objective

How much data are you willing to lose? If your backups run every 24 hours and your RPO is '24 hours', you accept losing up to one day of data in the worst case. If you use PITR-enabled databases, your RPO can be just a few minutes.

RTO — Recovery Time Objective

How quickly must the system be back online after a disaster? If your RTO is '4 hours', you need to be able to restore the database, redeploy the application, and pass all checks within four hours. RTOs are negotiated with business stakeholders — a hospital's patient records system might need an RTO of under 15 minutes, while a small blog might accept several hours.

Practical steps every team should take:

1. Enable automated daily snapshots on your database — most cloud services offer this as a one-click option.

2. Store snapshots in a different region from your main server.

3. Test your restore process at least once a year — a backup you have never tested is not a real backup.

4. Keep your application code in a git repository with at least two remote copies (e.g. GitHub + a local clone).

Scenario

Almaz runs an online grocery delivery app for Addis Ababa. She has one database server with no replicas. A developer accidentally drops a table. She has a backup — but it was taken 48 hours ago. Which improvement would MOST reduce her data loss?

Lesson recap: • Scaling — vertical scaling means upgrading one machine; horizontal scaling means adding more machines behind a load balancer. Most cloud apps prefer horizontal scaling. • Auto-scaling — cloud platforms can automatically add or remove servers based on real-time traffic, so you only pay for what you use. • Load balancer — distributes traffic across multiple servers and routes around failed ones, keeping your app available. • Reliability — is built through redundancy: multiple servers, multiple availability zones, and database replicas. • Backups — follow the 3-2-1 rule: three copies, two storage types, one off-site. Enable PITR on your database for near-zero data loss. • RPO and RTO — define how much data you can afford to lose and how fast you must recover. Set realistic targets and test your restore process regularly.

Check your understanding

1/7 · 79 XP

What is horizontal scaling?