All courses > Technology and Programming > Cloud Computing and Web Servers ::

Diagnosing Certificate and TLS Handshake Failures

Capítulo 18

Estimated reading time: 14 minutes

What “TLS handshake failure” means in Kubernetes and meshes

A TLS handshake failure is any situation where a client and server cannot complete the cryptographic negotiation required to establish an HTTPS connection. In cloud-native environments, this negotiation often involves multiple hops: a client to an Ingress controller or gateway, the gateway to an upstream service, and sometimes sidecars or L7 proxies in between. A failure can happen at any hop, and the error you see depends on where you observe it: the browser might show a generic “SSL error,” curl might show “handshake failure,” and a proxy might log “TLS alert” or “certificate verify failed.” The key diagnostic skill is to identify which hop is failing and why: certificate validity, hostname mismatch, missing intermediate CA, protocol/cipher mismatch, SNI routing issues, or client authentication requirements.

In Kubernetes, certificates are commonly stored as Secrets and mounted into Ingress controllers, gateways, or workloads. In service mesh deployments, certificates may be issued dynamically by a mesh CA and rotated frequently. This creates additional failure modes: stale secrets, mismatched trust bundles, proxies not reloading updated certs, or time skew causing “not yet valid” errors. Diagnosing handshake failures is therefore a combination of: (1) understanding the handshake expectations (server name, trust chain, client auth), (2) collecting evidence from the right place (client, proxy, server), and (3) verifying the certificate chain and configuration that each hop uses.

Common symptoms and how to map them to likely causes

Different tools and components surface different error strings. Mapping symptoms to causes helps you choose the next check instead of guessing. If you see “x509: certificate signed by unknown authority,” the client cannot validate the server’s chain to a trusted root (missing CA bundle, missing intermediate, wrong root, or a corporate MITM proxy). If you see “x509: certificate is valid for X, not Y,” the certificate’s Subject Alternative Names do not include the hostname you used (or SNI/Host header is routed to the wrong certificate). If you see “certificate has expired” or “not yet valid,” the certificate validity window is wrong (expired cert, rotation failed, or clock skew). If you see “tls: handshake failure” with little detail, it may be a protocol/cipher mismatch (for example, client only supports TLS 1.0/1.1 while server requires TLS 1.2+), or the server is configured to require client certificates and the client did not present one. If you see “unexpected EOF” or “connection reset by peer” during handshake, the peer may be closing because it cannot load its key/cert, cannot match SNI, or is failing early due to policy.

In Ingress and gateway scenarios, a frequent pattern is: browser fails, but internal service is healthy. That often indicates the edge certificate is wrong, the Ingress controller is serving a default certificate, or the request is hitting the wrong Ingress class or IP. In mesh scenarios, a frequent pattern is: edge TLS works, but upstream TLS fails between gateway and service because the gateway is configured to originate TLS to the backend with an incorrect SNI, incorrect CA bundle, or because the backend expects plain HTTP while the gateway is attempting HTTPS (or vice versa).

Step-by-step workflow: isolate the failing hop

Start by drawing the path and labeling where TLS terminates and where it is re-established. For example: Client → Ingress (TLS termination) → Service (HTTP) is different from Client → Ingress (TLS termination) → Upstream (TLS origination) or Client → Gateway (TLS passthrough) → Service (TLS termination). Your first step is to identify the termination point by checking configuration: Ingress annotations, Gateway listeners, and upstream cluster settings. Then test each hop independently with a tool that shows handshake details.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Step 1: Confirm DNS and the endpoint you are reaching. Resolve the hostname and ensure it maps to the expected load balancer IP. If multiple Ingress controllers exist, confirm the IngressClass and the controller’s service IP match. A wrong endpoint often leads to “wrong certificate” because you are talking to a different controller that serves a default cert.

Step 2: Test from outside with SNI and full verbosity. Use OpenSSL or curl to see which certificate is presented and whether the chain verifies. This tells you whether the edge is serving the expected certificate and whether clients can validate it.

Step 3: Test from inside the cluster. Run the same checks from a debug pod to eliminate corporate proxies, local trust stores, and egress filtering. If it works inside but fails outside, the issue is likely edge exposure, external DNS, or external trust requirements. If it fails inside too, focus on the Ingress/gateway configuration and secrets.

Step 4: If there is TLS origination from a gateway to an upstream, test the upstream directly (ClusterIP/Pod IP) and compare the certificate and SNI requirements. Many upstreams present different certificates based on SNI, so testing by IP without SNI can mislead you.

Practical commands: inspect the server certificate and chain

Use OpenSSL to retrieve the certificate chain and see handshake negotiation details. Always include SNI using -servername when testing a hostname, because modern proxies and servers rely on SNI to select the correct certificate.

openssl s_client -connect example.com:443 -servername example.com -showcerts

Key things to look for in the output: the leaf certificate Subject and SANs, the issuer, whether intermediate certificates are sent, and the verify return code. If you see Verify return code: 0 (ok), the chain validated against your local trust store. If you see unable to get local issuer certificate or unable to verify the first certificate, the server likely did not send required intermediates or the client lacks the correct root/intermediate.

To check the negotiated protocol and cipher, look for lines like Protocol : TLSv1.3 and Cipher : TLS_AES_256_GCM_SHA384. If the handshake fails before that, try forcing a protocol version to detect mismatches:

openssl s_client -connect example.com:443 -servername example.com -tls1_2

For HTTP-level confirmation plus certificate verification details, curl is often clearer:

curl -v https://example.com/

If you need to test with a specific CA bundle (for private PKI), use:

curl -v --cacert /path/to/ca.pem https://example.com/

Validate certificate contents: SANs, key usage, and expiration

Once you have the certificate, inspect it locally. A common failure is a certificate that lacks the correct SAN entry for the hostname (especially when migrating from a wildcard to a specific name, or when adding new subdomains). Another common issue is using a certificate intended for client authentication as a server certificate (wrong Extended Key Usage), which some clients reject.

openssl x509 -in server.crt -noout -text

Check these fields: Subject Alternative Name includes the hostname; Not Before and Not After are valid; Key Usage and Extended Key Usage include server auth; and the public key algorithm matches what your environment supports. If the certificate is ECDSA-only and some legacy clients require RSA, you may see handshake failures on older clients. In that case, consider serving dual certificates or using an RSA certificate at the edge depending on your client population.

Kubernetes Secrets and Ingress: verify the right secret is used

At the edge, a very common root cause is that the Ingress controller is not using the secret you think it is. This can happen if the secret is in a different namespace, the secret name is wrong, the Ingress references a secret that does not exist, or the controller does not have permission to read it. The controller may then fall back to a default certificate, which leads to hostname mismatch errors.

Step 1: Check the Ingress TLS section and secret name, and confirm the secret exists in the same namespace as the Ingress (for standard Ingress behavior).

kubectl get ingress -n app-ns my-ingress -o yaml

kubectl get secret -n app-ns my-tls-secret -o yaml

Step 2: Confirm the secret type and keys. For typical TLS secrets, you want type: kubernetes.io/tls and data keys tls.crt and tls.key. If the secret was created incorrectly (for example, generic secret with different key names), some controllers will not load it.

Step 3: Decode and inspect the certificate from the secret to ensure it matches the expected hostname and chain.

kubectl get secret -n app-ns my-tls-secret -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -subject -issuer -dates

Step 4: Check the Ingress controller logs for secret load errors, permission issues, or reload failures. Look for messages about “error obtaining X509 certificate,” “no such file,” “permission denied,” or “falling back to default certificate.”

kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep -iE 'cert|tls|secret|x509'

Intermediate CA problems: why “it works in my browser” can still fail

Browsers often cache intermediate certificates and may successfully build a chain even if the server fails to send the intermediate. Many non-browser clients (Go, Java, some API gateways) are stricter and will fail with “unknown authority” if the server does not provide the full chain. In Kubernetes, this frequently happens when you store only the leaf certificate in tls.crt instead of the full chain (leaf + intermediate(s)). The fix is usually to concatenate the leaf certificate and intermediate certificates into tls.crt in the correct order (leaf first, then intermediates), and update the secret.

After updating the secret, confirm the controller reloads it. Some controllers reload automatically; others require a pod restart or a specific reload mechanism. If you suspect stale configuration, compare the certificate served at the edge (via openssl s_client) with the certificate stored in the secret. If they differ, the controller is not serving the updated secret.

Time skew and rotation: “not yet valid” and sudden outages

Certificate validity is time-based, so clock issues can cause sudden handshake failures. In clusters, time skew can occur if nodes drift, NTP is blocked, or containers run with incorrect time settings (less common, but possible in certain environments). If you see “certificate not yet valid,” check the node time and the certificate’s Not Before field. If you see “expired,” verify whether rotation occurred and whether the new secret was distributed and reloaded.

Practical checks: inspect the certificate dates, then check node time on the nodes hosting the relevant pods. If you cannot SSH, you can often infer time skew by comparing pod logs timestamps with an external reference, but the most reliable method is checking node time directly via your infrastructure tooling. Also verify that any certificate automation (for example, a controller that renews certificates) is healthy and has permissions to update secrets.

SNI and multi-cert hosting: wrong certificate served

When multiple hostnames share the same IP and port, the server uses SNI to decide which certificate to present. If SNI is missing or incorrect, you may get a default certificate. This is common when testing by IP address, when a client library does not set SNI, or when a gateway is configured with an upstream TLS connection but does not set the correct SNI for the backend. The symptom is usually a hostname mismatch: the certificate is valid, but for a different name.

To diagnose, always test with -servername in OpenSSL and ensure your client is using the correct hostname. If you are configuring a gateway to originate TLS to an upstream, ensure the upstream SNI is explicitly set to the backend’s certificate name (often the DNS name, not the Kubernetes service name unless that is what the certificate contains). If the backend certificate is issued for service.namespace.svc but the gateway uses service or an external name, verification will fail.

Client certificate requirements: handshake fails before HTTP

Some endpoints require client certificates (mTLS at the edge or for specific upstreams). If a server requests a client certificate and the client does not provide one, the handshake may fail with alerts like “handshake failure” or “bad certificate,” and you will never reach HTTP. This can happen unintentionally if a gateway is configured to require client certs for all paths, or if you enabled a strict client-auth mode while only some clients have certificates.

To test, use OpenSSL with a client certificate and key:

openssl s_client -connect example.com:443 -servername example.com -cert client.crt -key client.key

If this succeeds while the no-cert test fails, the server requires client authentication. Next, confirm which CA the server trusts for client certificates; presenting a client cert signed by an untrusted CA will still fail. Also verify that the client certificate has the correct Extended Key Usage for client auth.

Protocol and cipher mismatches: TLS versions, FIPS, and legacy clients

Handshake failures can occur when the client and server cannot agree on a protocol version or cipher suite. This is common after tightening security settings (disabling TLS 1.0/1.1, removing RSA key exchange, enforcing modern AEAD ciphers) or when running in constrained compliance modes. Some older clients or embedded devices may not support TLS 1.2+ or modern ciphers. Conversely, some servers may be configured to disallow TLS 1.3 due to policy or library constraints, causing issues with certain clients.

Diagnose by forcing versions in OpenSSL and by checking server configuration in the proxy/gateway. If you control the edge, decide on a supported client matrix and configure minimum TLS version accordingly. If you do not control clients, you may need a compatibility endpoint or separate listener with different TLS policy. In Kubernetes, ensure the Ingress/gateway TLS settings match your intended policy and that you are not unintentionally inheriting a restrictive global policy.

Backend TLS origination failures: gateway-to-service HTTPS issues

Even if edge TLS termination works, a gateway may be configured to connect to the upstream service using TLS (TLS origination). Failures here often look like 503s or upstream connect errors in gateway logs, but the root cause is still certificate validation. Typical causes include: the gateway does not trust the backend CA, the backend certificate name does not match the SNI/hostname used by the gateway, the backend serves an incomplete chain, or the backend expects plain HTTP while the gateway speaks HTTPS.

Step-by-step: (1) Identify whether the gateway is using HTTPS to the backend. (2) From the gateway’s network context (or a debug pod in the same namespace), run openssl s_client to the backend service DNS name and port with the expected SNI. (3) Verify the backend certificate SANs include that name. (4) Ensure the gateway has the correct CA bundle configured for backend validation. If the backend uses a private CA, you must explicitly provide that CA to the gateway; relying on system roots will fail.

Debug pods and in-cluster testing: reproduce with minimal variables

To avoid chasing external factors, reproduce the handshake from inside the cluster using a temporary debug pod that includes OpenSSL and curl. This helps you confirm whether the issue is cluster-internal (secrets, proxy config, backend certs) or external (public DNS, CDN, corporate trust store). Use a short-lived pod in the same namespace and, if relevant, with the same service account as the gateway/workload to match network policies and permissions.

kubectl run -n app-ns tls-debug --rm -it --image=curlimages/curl:8.5.0 -- sh

curl -v https://my-service.app-ns.svc.cluster.local:443/

If you need OpenSSL and it is not present in the image, use an image that includes it or install it temporarily. The goal is not the specific image but the method: run the same handshake checks from the same network plane where the failing component runs.

Reading proxy and gateway logs: what to search for

Ingress controllers, gateways, and sidecar proxies often log TLS errors with enough detail to pinpoint the cause. Search for: “x509,” “certificate verify failed,” “unknown ca,” “no peer certificate,” “tls alert,” “SNI,” “handshake error,” “SSL routines,” or “alert number.” Correlate timestamps with client attempts. If your proxy supports access logs with TLS fields, enable logging of SNI, upstream host, and TLS version/cipher to see patterns (for example, only certain hostnames fail, or only certain client TLS versions fail).

When you find an error, tie it back to one of the core categories: trust chain, name mismatch, validity window, client auth, or protocol/cipher mismatch. Then validate with a direct OpenSSL/curl test that reproduces the exact condition (same hostname, same SNI, same CA bundle, same client cert behavior). This loop—observe, hypothesize, reproduce, confirm—is the fastest way to resolve TLS handshake failures without guesswork.

Now answer the exercise about the content:

When testing an HTTPS endpoint that hosts multiple certificates on the same IP and port, what is the most reliable way to avoid receiving a default (wrong) certificate?

You are right! Congratulations, now go to the next page

You missed! Try again.

With multi-cert hosting, servers use SNI to choose which certificate to present. Including the hostname (for example, with -servername) prevents falling back to a default certificate and avoids hostname mismatch confusion.