All courses > Technology and Programming > Cloud Computing and Web Servers ::

Azure Fundamentals for Web Hosting: Scaling and reliability basics for small-to-medium sites

Capítulo 10

Estimated reading time: 9 minutes

Why scaling and reliability matter for small-to-medium sites

Most web sites do not need complex architectures, but they do need two basics: the ability to handle traffic changes (scaling) and the ability to keep serving requests when something goes wrong (reliability). Scaling is about matching capacity to demand. Reliability is about reducing downtime and limiting the impact of failures. In Azure, the practical approach is to pick the scaling knobs that match your hosting option and then watch a small set of metrics that tell you when to adjust.

Scale up vs scale out (and why you usually start simple)

Scale up means giving a single instance more resources (CPU/RAM). It is simple and often the fastest fix for performance issues, but it does not improve availability much because you still rely on one instance. Scale out means running multiple instances and spreading traffic across them. It improves both capacity and availability, but requires your app to behave well in a multi-instance environment (stateless requests, shared session strategy, shared storage for uploads, etc.).

For small-to-medium sites, a common progression is: start with scale up until you hit a cost or performance limit, then add scale out for peak traffic and availability.

App Service: scaling basics without over-engineering

App Service scale up (vertical scaling)

In App Service, scale up means changing the App Service Plan tier/size to get more CPU, memory, and features. This is useful when your app is CPU-bound (slow responses under load) or memory-bound (frequent restarts due to memory pressure).

Step-by-step: scale up an App Service Plan

Open the App Service Plan in the Azure portal (not just the Web App).
Select Scale up (App Service plan).
Choose a larger size within your current tier, or move to a higher tier if needed.
Apply changes and monitor response time and CPU/memory metrics for at least one traffic cycle (for example, a business day).

Practical tip: scale up is typically a quick change with minimal app changes, but it does not protect you from an instance-level failure the way multi-instance scale out does.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

App Service scale out (horizontal scaling)

Scale out increases the number of instances running your app. App Service handles distributing incoming requests across instances. This is the primary lever for handling traffic spikes and improving availability.

Autoscale rules (what to use and what to avoid)

Autoscale lets you automatically add/remove instances based on signals. For small-to-medium sites, keep rules simple and stable. The most common signals are:

CPU percentage: good general-purpose signal for compute-bound apps.
Memory percentage (where available): useful if memory pressure causes restarts.
HTTP queue length / requests (if available in your plan/telemetry): useful when CPU is not the bottleneck but request backlog grows.

Avoid overly sensitive rules that scale in and out frequently (thrashing). Use longer evaluation windows and cool-down periods.

Step-by-step: configure App Service autoscale

Open the App Service Plan in the Azure portal.
Select Scale out (App Service plan) and enable Custom autoscale.
Set minimum, maximum, and default instance counts. For example: minimum 1–2, maximum 4–10 depending on budget and expected spikes.
Add a rule to scale out, for example: CPU > 70% for 10 minutes, increase by 1 instance.
Add a rule to scale in, for example: CPU < 40% for 20 minutes, decrease by 1 instance.
Set a cool-down (for example 5–10 minutes) to prevent rapid oscillation.
Test by generating controlled load (or during a known peak) and verify instance count changes and response time improvements.

Practical example: if marketing campaigns cause sudden bursts, set a higher minimum during campaign hours using a schedule-based profile, then revert after.

Container Apps: replica scaling signals you can trust

How scaling works in Container Apps

In Azure Container Apps, scaling is typically expressed as changing the number of replicas of your container. You define minimum and maximum replicas and choose scaling triggers. Container Apps can scale to multiple replicas quickly and can also scale down when demand drops, helping cost control.

Common scaling signals

HTTP concurrency: scale based on how many concurrent requests each replica should handle. This is often easier to reason about than CPU for web APIs.
CPU and memory: scale when resource usage crosses thresholds.
Event-driven signals (when applicable): scale based on queue length or other event sources (useful for background processing tied to the web workload).

For small-to-medium web sites, start with HTTP concurrency or CPU, and keep min replicas at 1–2 depending on availability needs.

Step-by-step: configure replica scaling (conceptual workflow)

Decide your baseline: set min replicas to 1 for cost-first, or 2 for higher availability.
Set max replicas based on budget and expected peak traffic.
Choose a trigger: for example, HTTP concurrency with a target of N concurrent requests per replica.
Load test: increase traffic and observe whether replicas increase before latency becomes unacceptable.
Adjust target concurrency or CPU thresholds until scaling happens early enough to keep response times stable.

Practical example: if each replica comfortably handles 50 concurrent requests with acceptable latency, set the concurrency target near that value and allow enough max replicas to cover your peak.

Virtual Machines: scaling approaches and load balancing at a high level

Manual scaling (vertical and horizontal)

With VMs, you control scaling more directly. You can scale up by resizing a VM to a larger SKU, or scale out by adding more VMs and distributing traffic across them. Manual scaling is acceptable for predictable patterns (for example, steady business hours) when you can plan changes ahead of time.

Step-by-step: manual scale up for a VM

Review performance metrics (CPU, memory, disk, network) to confirm the bottleneck.
Plan a maintenance window if resizing requires a restart.
Resize the VM to a larger size that addresses the bottleneck (more vCPU/RAM, or better disk throughput if storage is the issue).
Validate application performance and confirm the bottleneck is resolved.

VM Scale Sets (conceptual)

VM Scale Sets are the Azure-native way to run a group of similar VMs and scale the instance count up or down. Conceptually, you define a VM model (image, configuration), then let the scale set manage multiple instances. Autoscaling can be based on metrics like CPU or custom signals. For web hosting, scale sets are commonly paired with a load balancer so traffic is spread across instances.

Load balancing (high level)

When you run multiple VM instances, you need a way to distribute incoming requests. At a high level, a load balancer routes traffic to healthy instances and stops sending traffic to unhealthy ones. Health probes (health checks) determine whether an instance should receive traffic. The practical takeaway: multi-VM scaling is not just “add another server”; it also requires consistent configuration and a health-checked traffic distribution layer.

Capacity planning for common traffic patterns

Pattern 1: spiky marketing traffic

Characteristics: sudden bursts, unpredictable peaks, short duration. Goal: absorb spikes without paying for peak capacity all day.

Prefer autoscale (App Service autoscale, Container Apps replica scaling, or Scale Set autoscale).
Set a reasonable minimum to avoid cold starts or slow ramp-up at the beginning of a spike.
Use step scaling (add more than one instance when thresholds are exceeded) if spikes are steep.
Consider a schedule-based profile for known campaign windows (raise minimum instances during the campaign).

Rule-of-thumb approach: estimate peak requests per second (RPS) from campaign expectations, then validate with a load test. Ensure max instances/replicas can cover that peak with headroom (for example 20–30%) so autoscale does not run at the limit.

Pattern 2: steady business hours

Characteristics: predictable daily peaks, moderate variation. Goal: stable performance and cost control.

Use scheduled scaling where possible (increase capacity before business hours, decrease after).
Keep autoscale enabled as a safety net, but with conservative thresholds to avoid unnecessary scaling.
Focus on scale up first if the app is consistently near CPU/memory limits and scaling out adds complexity.

Rule-of-thumb approach: size for the 90th percentile load during business hours, then use autoscale to handle unusual days.

Metrics to watch (and what they tell you)

Core performance metrics

Response time / latency: the user-visible outcome; rising latency under load usually means you need more capacity or a bottleneck fix.
HTTP 5xx rate: server errors often indicate overload, crashes, or dependency failures.
Request rate (RPS): helps correlate traffic changes with resource usage.

Resource saturation metrics

CPU percentage: sustained high CPU suggests compute-bound workload; good autoscale trigger.
Memory usage: rising memory with restarts suggests leaks or insufficient memory; can drive scale up or replica count.
Disk I/O and latency (VM-heavy): high disk latency can cause slow pages even when CPU is fine.
Network in/out: helps detect bandwidth constraints or abnormal traffic patterns.

Scaling effectiveness metrics

Instance/replica count over time: confirm scaling occurs before latency becomes unacceptable.
Time to scale: if scale-out happens too late, adjust thresholds, evaluation windows, or minimum capacity.
Queue length / concurrency (where available): indicates backlog; often a better early warning than CPU alone.

Practical workflow: pick one primary trigger (CPU or concurrency), then validate with latency and error rate. If latency rises before the trigger fires, the trigger is too late or the bottleneck is elsewhere (for example, database or external API).

Simple reliability checklist (small-to-medium site)

Health checks

Expose a lightweight health endpoint (for example /health) that verifies the app can serve requests.
Ensure health checks reflect real readiness (for example, app started and critical dependencies reachable if required).
Use health checks to remove unhealthy instances from rotation (especially important for multi-instance setups).

Restart behavior

Confirm what happens on process crash: does the platform restart it automatically (App Service/Container Apps typically do; VMs depend on your service manager and configuration).
Validate startup time: long startups can cause brief outages during scaling or restarts; keep startup fast and predictable.
Make sure logs and diagnostics persist across restarts (send logs to a centralized sink rather than local disk only).

Multi-instance considerations

Design for stateless web servers: avoid storing session state only in memory on one instance.
Ensure uploaded files and generated content are stored in shared storage or external services, not on a single instance’s local filesystem.
Verify background jobs are not duplicated unintentionally when multiple instances run (use a single-worker pattern or distributed locks where needed).
Test rolling changes: scale out to 2+ instances and confirm requests succeed during an instance recycle.

# Quick self-check questions (use during a scaling change)
1) If one instance disappears, do users still get responses?
2) If traffic doubles in 5 minutes, will autoscale add capacity fast enough?
3) Are health checks accurate (not always green, not always red)?
4) Can the app run correctly with 2+ instances (sessions, uploads, jobs)?

Now answer the exercise about the content:

When configuring autoscale for a small-to-medium web app, which approach best helps prevent frequent scale-in/scale-out “thrashing”?

You are right! Congratulations, now go to the next page

You missed! Try again.

To avoid rapid oscillation, autoscale rules should be simple and stable, using longer evaluation windows and cool-down periods rather than reacting to every small metric change.