How to Think in Incidents: Turn “It’s Down” Into a Network Hypothesis
In cloud hosting, most outages are not “mysteries”; they are mismatches between intent (what you think should be allowed/routed/resolved) and reality (what is actually configured). The fastest way to troubleshoot is to treat each report like an incident: capture symptoms, list known facts, form a hypothesis, run targeted tests, interpret results, apply a minimal fix, and verify from the same vantage points that originally failed.
The case studies below intentionally connect routing, firewalling (security groups and NACLs), and name resolution, because real incidents often span more than one layer.
Case Study 1: New Subnet Can’t Reach the Internet (Missing Default Route)
Symptoms
- Instances in a newly created private subnet cannot download packages or reach external APIs.
- DNS resolution works (names resolve), but connections time out.
- Instances in older subnets in the same VPC work normally.
Known Facts
- The new subnet is associated with a new route table.
- The instances have private IPs only (no public IPs).
- An egress path should exist via NAT (private subnet design).
- Security group egress is “allow all” (or at least allows 443/80).
Hypothesis
The new subnet’s route table is missing a default route (0.0.0.0/0) to the NAT gateway (or to an egress device), so traffic has no path to the internet.
Tests
- Inspect the subnet’s route table association and routes.
- From an instance in the new subnet, attempt a TCP connection to a known public IP (bypasses DNS as a variable), e.g., a public package mirror IP on port 443.
- Compare the route table of a working private subnet to the new one.
Results
- The new route table has only local VPC routes (e.g., 10.0.0.0/16 local) and no 0.0.0.0/0 route.
- TCP connection attempts to public IPs time out.
- Working private subnets have 0.0.0.0/0 pointing to a NAT gateway.
Fix
Add the missing default route in the new subnet’s route table.
- For private subnets: add
0.0.0.0/0 -> NAT Gateway(in the same AZ as the subnet, per best practice). - For public subnets: add
0.0.0.0/0 -> Internet Gatewayand ensure instances have public IPs (or an attached public interface) if they must be directly reachable.
Verification
- From the same instance, retry outbound TCP to a public IP:443 and confirm it connects.
- Run an OS package update or curl to a public HTTPS endpoint and confirm success.
- Confirm return traffic works (stateful NAT): establish a full TLS session (not just a SYN).
Case Study 2: Web Server Reachable by IP but Not by Domain (DNS Record Mismatch)
Symptoms
http://203.0.113.10loads the website.http://www.example.comfails (browser shows “server not found” or connects to the wrong site).- Some users report it works, others report it does not (inconsistent behavior).
Known Facts
- The web server is healthy and responds on port 80/443 when accessed by IP.
- The domain was recently migrated or a new load balancer was introduced.
- There may be multiple DNS records (A/AAAA/CNAME) and possibly multiple hosted zones (public vs private).
Hypothesis
The DNS record for the domain points to the wrong target (old IP, wrong load balancer, wrong record type), or clients are receiving different answers due to split-horizon DNS or cached TTL.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Tests
- Query DNS from multiple resolvers (local, public, and cloud resolver) and compare answers.
- Check authoritative DNS records for
example.comandwww.example.com. - Confirm whether IPv6 is in play: check for AAAA records that might point elsewhere.
# Query using a public resolver (example uses Google DNS; use your preferred resolver too) dig +short A www.example.com @8.8.8.8 dig +short AAAA www.example.com @8.8.8.8 # Query the authoritative nameserver (replace with actual NS) dig +short NS example.com dig +short A www.example.com @ns1.authoritative-dns.net # Compare what your server sees (cloud instance resolver) dig +short A www.example.comResults
- Public resolvers return an old IP (or an unexpected load balancer hostname).
- Authoritative records show the wrong A record, or a stale AAAA record exists.
- Some resolvers still return the old value due to TTL caching, explaining inconsistent user reports.
Fix
- Correct the DNS record to the intended target (new IP or load balancer DNS name via CNAME/alias as appropriate).
- Remove or correct any unintended AAAA record if IPv6 is not configured end-to-end.
- If split-horizon is used, ensure the public hosted zone and private hosted zone are not conflicting for the same name unless that is intentional.
- Lower TTL before planned migrations; after fixing, wait for TTL propagation or flush caches where possible (internal resolvers, CDN, local DNS caches).
Verification
- Repeat DNS queries against authoritative servers and multiple public resolvers until they match expected values.
- Test HTTP/HTTPS by domain from at least two networks (e.g., your workstation and a cloud instance) to avoid local caching bias.
- If a load balancer is involved, verify the Host header routing (virtual hosts) serves the correct site for
www.example.com.
Case Study 3: HTTPS Fails After Deployment (Wrong Security Group Inbound Rule)
Symptoms
- HTTP (port 80) works, but HTTPS (port 443) times out or is refused.
- Health checks for HTTPS fail; monitoring shows the service “down” only for TLS.
- From inside the same subnet/VPC, HTTPS might work (depending on rules), but from the internet it fails.
Known Facts
- A new security group was attached during deployment, or rules were “tightened.”
- The application is listening on 443 (confirmed on the instance), or a load balancer expects 443 on targets.
- Cloud security groups are stateful; NACLs are stateless (both may apply).
Hypothesis
The inbound rule for TCP/443 is missing or too restrictive (wrong source CIDR, wrong SG reference, or only allowing 80). Alternatively, the load balancer security group allows 443 from the internet, but the instance security group does not allow 443 from the load balancer.
Tests
- From an external client, attempt a TCP connect to 443 and observe whether it times out (filtered) or is refused (reachable but not listening).
- Check security group inbound rules on the load balancer and on the instance/target.
- If using a load balancer, confirm the target security group allows inbound 443 from the load balancer security group (not from 0.0.0.0/0 unless intended).
# From a client outside the VPC (or a test box) nc -vz www.example.com 443 # Or test TLS handshake (shows more detail) openssl s_client -connect www.example.com:443 -servername www.example.comResults
- TCP/443 connection times out from the internet.
- Instance security group has inbound 80 allowed, but 443 missing.
- Or: instance SG allows 443 only from a narrow CIDR that does not include the load balancer subnets, or it references the wrong SG.
Fix
- Add inbound rule: TCP/443 from the correct source.
- If behind a load balancer: allow TCP/443 from the load balancer’s security group (preferred) or from the load balancer subnet CIDRs if SG referencing is not possible in your environment.
- Double-check NACLs for 443 inbound and ephemeral ports outbound/return if NACLs are restrictive.
Verification
- Re-run
openssl s_clientand confirm a successful handshake (certificate details appear and no timeout). - Confirm load balancer target health for HTTPS becomes healthy.
- Validate from at least one external network and one internal network to confirm the intended exposure.
Case Study 4: Service-to-Service Calls Fail Across Subnets (Route Table or Network ACL)
Symptoms
- Service A (10.0.1.0/24) cannot call Service B (10.0.2.0/24) on TCP/8080.
- Calls work when both services are placed in the same subnet, but fail when separated.
- Ping might work (if allowed), but the application call fails; or everything fails depending on policy.
Known Facts
- Both subnets are in the same VPC (so “local” routing should exist).
- There is a network ACL attached to one or both subnets with custom rules.
- Security groups may be locked down to specific sources.
Hypothesis
Either (1) the route table association is wrong (subnet accidentally associated with a route table that does not include the local VPC route, or uses an overlapping/incorrect CIDR), or (2) a NACL is blocking either the inbound service port (8080) or the return ephemeral ports, causing one-way failure, or (3) security group rules do not allow the source subnet/SG.
Tests
- Confirm both instances’ IPs and subnets, and verify both subnets are truly in the same VPC CIDR.
- Check route tables associated with each subnet and ensure the VPC local route exists and is correct.
- Check NACL rules for both subnets: inbound 8080 to Service B, and outbound ephemeral ports back to Service A (and vice versa for return traffic).
- Check security groups: Service B inbound should allow TCP/8080 from Service A’s security group (preferred) or from Service A subnet CIDR.
# From Service A instance, test connectivity to Service B nc -vz 10.0.2.25 8080 # If you have HTTP health endpoint curl -v http://10.0.2.25:8080/health # If you can capture packets (Linux) sudo tcpdump -ni eth0 host 10.0.2.25 and tcp port 8080Results
- Route issue variant: Subnet 10.0.2.0/24 is associated with a route table intended for a peered VPC or a different environment; local route is missing or the VPC CIDR differs from expectation.
- NACL issue variant: Inbound allows 8080, but outbound denies ephemeral ports (e.g., 1024–65535), so SYN reaches Service B but SYN-ACK cannot return, causing timeouts.
- SG issue variant: Service B SG allows 8080 only from a different SG or CIDR; traffic is dropped at the instance boundary.
Fix
- Route table: Associate the correct route table to each subnet; ensure the local VPC route exists and there are no overlapping CIDR mistakes. If using peering/transit, ensure routes to remote CIDRs are present in both directions.
- NACL: Add/adjust rules to allow the service port inbound and allow return traffic (ephemeral ports) in the opposite direction. Remember NACLs are stateless: you must allow both directions explicitly.
- Security groups: On Service B, allow inbound TCP/8080 from Service A’s security group (tightest) or from the specific source CIDR. Ensure Service A egress allows TCP/8080 to Service B.
Verification
- Re-test with
ncandcurlfrom Service A to Service B. - If you used tcpdump, confirm you now see a full handshake (SYN, SYN-ACK, ACK) and application data.
- Validate from the application layer (actual service call) to ensure no additional dependency (DNS, TLS, auth) is masking as “network.”
Compact Reference: Common Hosting Connectivity Root Causes and Fast Confirming Tests
Routing and Egress
- Missing default route (0.0.0.0/0) in subnet route table (private subnet to NAT, public subnet to IGW). Test: inspect route table; attempt TCP to a public IP:443 from the instance.
- Wrong route table association (subnet attached to the wrong route table). Test: compare subnet associations between working and failing subnets.
- Asymmetric routing via peering/transit (route exists one way only). Test: verify routes in both VPCs/attachments; use traceroute from both sides where possible.
Security Groups and NACLs
- Missing inbound rule for the needed port (e.g., 443). Test: external
nc -vz host port; check SG inbound sources. - Wrong source scope (allowed from wrong CIDR or wrong SG reference). Test: confirm caller IP/CIDR or SG identity; temporarily widen source to confirm, then tighten correctly.
- NACL blocks return traffic (ephemeral ports) because it is stateless. Test: look for SYN without SYN-ACK using tcpdump; review NACL inbound/outbound rules for ephemeral ranges.
DNS and Naming
- Record points to wrong target (old IP, wrong LB, wrong zone). Test:
digagainst authoritative NS and multiple resolvers; compare A/AAAA/CNAME answers. - IPv6 mismatch (AAAA exists but path/service not configured). Test: query AAAA; attempt connect to IPv6 address; remove/repair AAAA if unintended.
- Split-horizon confusion (public vs private hosted zone answers differ). Test: query from inside VPC resolver and from public resolver and compare.
Quick “Where Is It Breaking?” Tests
- Bypass DNS: connect to IP directly to separate name resolution from reachability.
- Port-specific reachability:
nc -vzto distinguish “port blocked” from “host down.” - TLS handshake visibility:
openssl s_clientto confirm 443 is reachable and see handshake failures vs timeouts. - Packet-level truth:
tcpdumpto confirm whether SYN arrives and whether replies leave. - Config diff: compare route tables, SGs, and NACLs between a working subnet/service and the failing one.