All courses > Technology and Programming > Cloud Computing and Web Servers ::

Connecting the Dots in Cloud Hosting: Routes, Firewalls, and Real Incidents

Capítulo 13

Estimated reading time: 10 minutes

How to Think in Incidents: Turn “It’s Down” Into a Network Hypothesis

In cloud hosting, most outages are not “mysteries”; they are mismatches between intent (what you think should be allowed/routed/resolved) and reality (what is actually configured). The fastest way to troubleshoot is to treat each report like an incident: capture symptoms, list known facts, form a hypothesis, run targeted tests, interpret results, apply a minimal fix, and verify from the same vantage points that originally failed.

The case studies below intentionally connect routing, firewalling (security groups and NACLs), and name resolution, because real incidents often span more than one layer.

Case Study 1: New Subnet Can’t Reach the Internet (Missing Default Route)

Symptoms

Instances in a newly created private subnet cannot download packages or reach external APIs.
DNS resolution works (names resolve), but connections time out.
Instances in older subnets in the same VPC work normally.

Known Facts

The new subnet is associated with a new route table.
The instances have private IPs only (no public IPs).
An egress path should exist via NAT (private subnet design).
Security group egress is “allow all” (or at least allows 443/80).

Hypothesis

The new subnet’s route table is missing a default route (0.0.0.0/0) to the NAT gateway (or to an egress device), so traffic has no path to the internet.

Tests

Inspect the subnet’s route table association and routes.
From an instance in the new subnet, attempt a TCP connection to a known public IP (bypasses DNS as a variable), e.g., a public package mirror IP on port 443.
Compare the route table of a working private subnet to the new one.

Results

The new route table has only local VPC routes (e.g., 10.0.0.0/16 local) and no 0.0.0.0/0 route.
TCP connection attempts to public IPs time out.
Working private subnets have 0.0.0.0/0 pointing to a NAT gateway.

Fix

Add the missing default route in the new subnet’s route table.

For private subnets: add 0.0.0.0/0 -> NAT Gateway (in the same AZ as the subnet, per best practice).
For public subnets: add 0.0.0.0/0 -> Internet Gateway and ensure instances have public IPs (or an attached public interface) if they must be directly reachable.

Verification

From the same instance, retry outbound TCP to a public IP:443 and confirm it connects.
Run an OS package update or curl to a public HTTPS endpoint and confirm success.
Confirm return traffic works (stateful NAT): establish a full TLS session (not just a SYN).

Case Study 2: Web Server Reachable by IP but Not by Domain (DNS Record Mismatch)

Symptoms

http://203.0.113.10 loads the website.
http://www.example.com fails (browser shows “server not found” or connects to the wrong site).
Some users report it works, others report it does not (inconsistent behavior).

Known Facts

The web server is healthy and responds on port 80/443 when accessed by IP.
The domain was recently migrated or a new load balancer was introduced.
There may be multiple DNS records (A/AAAA/CNAME) and possibly multiple hosted zones (public vs private).

Hypothesis

The DNS record for the domain points to the wrong target (old IP, wrong load balancer, wrong record type), or clients are receiving different answers due to split-horizon DNS or cached TTL.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Tests

Query DNS from multiple resolvers (local, public, and cloud resolver) and compare answers.
Check authoritative DNS records for example.com and www.example.com.
Confirm whether IPv6 is in play: check for AAAA records that might point elsewhere.

# Query using a public resolver (example uses Google DNS; use your preferred resolver too)  dig +short A www.example.com @8.8.8.8  dig +short AAAA www.example.com @8.8.8.8  # Query the authoritative nameserver (replace with actual NS)  dig +short NS example.com  dig +short A www.example.com @ns1.authoritative-dns.net  # Compare what your server sees (cloud instance resolver)  dig +short A www.example.com

Results

Public resolvers return an old IP (or an unexpected load balancer hostname).
Authoritative records show the wrong A record, or a stale AAAA record exists.
Some resolvers still return the old value due to TTL caching, explaining inconsistent user reports.

Fix

Correct the DNS record to the intended target (new IP or load balancer DNS name via CNAME/alias as appropriate).
Remove or correct any unintended AAAA record if IPv6 is not configured end-to-end.
If split-horizon is used, ensure the public hosted zone and private hosted zone are not conflicting for the same name unless that is intentional.
Lower TTL before planned migrations; after fixing, wait for TTL propagation or flush caches where possible (internal resolvers, CDN, local DNS caches).

Verification

Repeat DNS queries against authoritative servers and multiple public resolvers until they match expected values.
Test HTTP/HTTPS by domain from at least two networks (e.g., your workstation and a cloud instance) to avoid local caching bias.
If a load balancer is involved, verify the Host header routing (virtual hosts) serves the correct site for www.example.com.

Case Study 3: HTTPS Fails After Deployment (Wrong Security Group Inbound Rule)

Symptoms

HTTP (port 80) works, but HTTPS (port 443) times out or is refused.
Health checks for HTTPS fail; monitoring shows the service “down” only for TLS.
From inside the same subnet/VPC, HTTPS might work (depending on rules), but from the internet it fails.

Known Facts

A new security group was attached during deployment, or rules were “tightened.”
The application is listening on 443 (confirmed on the instance), or a load balancer expects 443 on targets.
Cloud security groups are stateful; NACLs are stateless (both may apply).

Hypothesis

The inbound rule for TCP/443 is missing or too restrictive (wrong source CIDR, wrong SG reference, or only allowing 80). Alternatively, the load balancer security group allows 443 from the internet, but the instance security group does not allow 443 from the load balancer.

Tests

From an external client, attempt a TCP connect to 443 and observe whether it times out (filtered) or is refused (reachable but not listening).
Check security group inbound rules on the load balancer and on the instance/target.
If using a load balancer, confirm the target security group allows inbound 443 from the load balancer security group (not from 0.0.0.0/0 unless intended).

# From a client outside the VPC (or a test box)  nc -vz www.example.com 443  # Or test TLS handshake (shows more detail)  openssl s_client -connect www.example.com:443 -servername www.example.com

Results

TCP/443 connection times out from the internet.
Instance security group has inbound 80 allowed, but 443 missing.
Or: instance SG allows 443 only from a narrow CIDR that does not include the load balancer subnets, or it references the wrong SG.

Fix

Add inbound rule: TCP/443 from the correct source.
If behind a load balancer: allow TCP/443 from the load balancer’s security group (preferred) or from the load balancer subnet CIDRs if SG referencing is not possible in your environment.
Double-check NACLs for 443 inbound and ephemeral ports outbound/return if NACLs are restrictive.

Verification

Re-run openssl s_client and confirm a successful handshake (certificate details appear and no timeout).
Confirm load balancer target health for HTTPS becomes healthy.
Validate from at least one external network and one internal network to confirm the intended exposure.

Case Study 4: Service-to-Service Calls Fail Across Subnets (Route Table or Network ACL)

Symptoms

Service A (10.0.1.0/24) cannot call Service B (10.0.2.0/24) on TCP/8080.
Calls work when both services are placed in the same subnet, but fail when separated.
Ping might work (if allowed), but the application call fails; or everything fails depending on policy.

Known Facts

Both subnets are in the same VPC (so “local” routing should exist).
There is a network ACL attached to one or both subnets with custom rules.
Security groups may be locked down to specific sources.

Hypothesis

Either (1) the route table association is wrong (subnet accidentally associated with a route table that does not include the local VPC route, or uses an overlapping/incorrect CIDR), or (2) a NACL is blocking either the inbound service port (8080) or the return ephemeral ports, causing one-way failure, or (3) security group rules do not allow the source subnet/SG.

Tests

Confirm both instances’ IPs and subnets, and verify both subnets are truly in the same VPC CIDR.
Check route tables associated with each subnet and ensure the VPC local route exists and is correct.
Check NACL rules for both subnets: inbound 8080 to Service B, and outbound ephemeral ports back to Service A (and vice versa for return traffic).
Check security groups: Service B inbound should allow TCP/8080 from Service A’s security group (preferred) or from Service A subnet CIDR.

# From Service A instance, test connectivity to Service B  nc -vz 10.0.2.25 8080  # If you have HTTP health endpoint  curl -v http://10.0.2.25:8080/health  # If you can capture packets (Linux)  sudo tcpdump -ni eth0 host 10.0.2.25 and tcp port 8080

Results

Route issue variant: Subnet 10.0.2.0/24 is associated with a route table intended for a peered VPC or a different environment; local route is missing or the VPC CIDR differs from expectation.
NACL issue variant: Inbound allows 8080, but outbound denies ephemeral ports (e.g., 1024–65535), so SYN reaches Service B but SYN-ACK cannot return, causing timeouts.
SG issue variant: Service B SG allows 8080 only from a different SG or CIDR; traffic is dropped at the instance boundary.

Fix

Route table: Associate the correct route table to each subnet; ensure the local VPC route exists and there are no overlapping CIDR mistakes. If using peering/transit, ensure routes to remote CIDRs are present in both directions.
NACL: Add/adjust rules to allow the service port inbound and allow return traffic (ephemeral ports) in the opposite direction. Remember NACLs are stateless: you must allow both directions explicitly.
Security groups: On Service B, allow inbound TCP/8080 from Service A’s security group (tightest) or from the specific source CIDR. Ensure Service A egress allows TCP/8080 to Service B.

Verification

Re-test with nc and curl from Service A to Service B.
If you used tcpdump, confirm you now see a full handshake (SYN, SYN-ACK, ACK) and application data.
Validate from the application layer (actual service call) to ensure no additional dependency (DNS, TLS, auth) is masking as “network.”

Compact Reference: Common Hosting Connectivity Root Causes and Fast Confirming Tests

Routing and Egress

Missing default route (0.0.0.0/0) in subnet route table (private subnet to NAT, public subnet to IGW). Test: inspect route table; attempt TCP to a public IP:443 from the instance.
Wrong route table association (subnet attached to the wrong route table). Test: compare subnet associations between working and failing subnets.
Asymmetric routing via peering/transit (route exists one way only). Test: verify routes in both VPCs/attachments; use traceroute from both sides where possible.

Security Groups and NACLs

Missing inbound rule for the needed port (e.g., 443). Test: external nc -vz host port; check SG inbound sources.
Wrong source scope (allowed from wrong CIDR or wrong SG reference). Test: confirm caller IP/CIDR or SG identity; temporarily widen source to confirm, then tighten correctly.
NACL blocks return traffic (ephemeral ports) because it is stateless. Test: look for SYN without SYN-ACK using tcpdump; review NACL inbound/outbound rules for ephemeral ranges.

DNS and Naming

Record points to wrong target (old IP, wrong LB, wrong zone). Test: dig against authoritative NS and multiple resolvers; compare A/AAAA/CNAME answers.
IPv6 mismatch (AAAA exists but path/service not configured). Test: query AAAA; attempt connect to IPv6 address; remove/repair AAAA if unintended.
Split-horizon confusion (public vs private hosted zone answers differ). Test: query from inside VPC resolver and from public resolver and compare.

Quick “Where Is It Breaking?” Tests

Bypass DNS: connect to IP directly to separate name resolution from reachability.
Port-specific reachability: nc -vz to distinguish “port blocked” from “host down.”
TLS handshake visibility: openssl s_client to confirm 443 is reachable and see handshake failures vs timeouts.
Packet-level truth: tcpdump to confirm whether SYN arrives and whether replies leave.
Config diff: compare route tables, SGs, and NACLs between a working subnet/service and the failing one.

Now answer the exercise about the content:

A new private subnet’s instances can resolve DNS names but outbound connections to public IPs time out, while older subnets work. Which action most directly fixes this scenario?

You are right! Congratulations, now go to the next page

You missed! Try again.

DNS working but outbound traffic timing out from a private subnet commonly indicates missing egress routing. Adding a default route (0.0.0.0/0) to the NAT gateway provides the internet path needed for outbound connections.

100%

CCNA-Level Networking for Cloud and Web Hosting: The Essentials You Actually Use

New course

13 chapters

Connecting the Dots in Cloud Hosting: Routes, Firewalls, and Real Incidents

How to Think in Incidents: Turn “It’s Down” Into a Network Hypothesis

Case Study 1: New Subnet Can’t Reach the Internet (Missing Default Route)

Symptoms

Known Facts

Hypothesis

Tests

Results

Fix

Verification

Case Study 2: Web Server Reachable by IP but Not by Domain (DNS Record Mismatch)

Symptoms

Known Facts

Hypothesis

Tests

Results

Fix

Verification

Case Study 3: HTTPS Fails After Deployment (Wrong Security Group Inbound Rule)

Symptoms

Known Facts

Hypothesis

Tests

Results

Fix

Verification

Case Study 4: Service-to-Service Calls Fail Across Subnets (Route Table or Network ACL)

Symptoms

Known Facts

Hypothesis

Tests

Results

Fix

Verification

Compact Reference: Common Hosting Connectivity Root Causes and Fast Confirming Tests

Routing and Egress

Security Groups and NACLs

DNS and Naming

Quick “Where Is It Breaking?” Tests

A new private subnet’s instances can resolve DNS names but outbound connections to public IPs time out, while older subnets work. Which action most directly fixes this scenario?

CCNA-Level Networking for Cloud and Web Hosting: The Essentials You Actually Use

LearnCloud Computing and Web Servers

LearnTechnology and Programming