A certificate has expired in production. Customers are seeing browser warnings, API calls are failing, and support tickets are piling up. Every minute the service is down is measurable revenue loss and reputation damage.

The response matters more than the prevention at this point. This playbook covers the exact order of operations to restore service, what to check along the way, and what to document before the incident closes.

The First Five Minutes

Speed matters, but so does not making things worse. Start here:

  1. Confirm the failure mode. Is it actually expiration, or is it a related failure (chain break, hostname mismatch, revocation) that's easier to mistake for expiry? Use the SSL Certificate Check on Mr. DNS to see the full certificate details without logging into the server. It will show the exact notAfter date, chain status, and SANs.
  2. Identify the scope. One hostname or many? One server or a pool? The answer determines whether you're fixing a single deployment or a systemic renewal failure.
  3. Check if automation already succeeded on some hosts. If certbot ran correctly on 3 of 5 backend servers, you have a configuration drift problem, not a certificate issue. The fix is to redeploy from the working host.
  4. Do not revoke the old certificate. It's already invalid by expiry. Revoking adds nothing and complicates the audit trail.

Keep this phase to five minutes. After that, you're implementing fixes.

Fast Recovery Paths, Ranked by Speed

In descending order of how quickly they restore service:

1. Renew via existing automation (2 to 10 minutes)

If automation exists but failed earlier, force a manual renewal. For certbot:

certbot renew --force-renewal
systemctl reload nginx

If this succeeds, you're done. Investigate why the automatic timer didn't fire after the incident is resolved.

2. Issue a new certificate from the same CA (5 to 15 minutes)

If automation is broken or wasn't configured, request a new certificate directly. For Let's Encrypt, certbot can issue on-demand:

certbot certonly --standalone -d example.com

For commercial CAs, the fastest path is usually their ACME endpoint or API rather than a manual portal workflow. Rate limits and domain validation still apply.

3. Deploy a standby certificate (1 to 5 minutes if you have one)

Some organizations keep a pre-issued backup certificate for exactly this scenario, ideally with overlapping SANs. If you have one, deploying it is faster than issuing a new one. If you don't, add this to your postmortem actions.

4. Temporary workaround: self-signed certificate

Only use this for internal services where clients can be instructed to bypass the warning, never for public-facing endpoints. A self-signed cert is a last resort that stops error logs but doesn't restore user service.

Common Complications

CAA records block issuance. If DNS has a CAA record listing only one CA and you're trying to issue from a different one, the request will fail with no useful error at the client. Verify CAA with the DNS Lookup on Mr. DNS.

Rate limits block renewal. Let's Encrypt enforces per-domain and per-account rate limits. If previous failed renewal attempts consumed the budget, you may need to wait or use a staging environment. The Let's Encrypt rate limits documentation has current numbers.

DNS propagation delays DNS-01 challenges. If you use DNS-01 validation and recently made DNS changes, the challenge record may not be visible to the CA yet. Verify propagation at the DNS Propagation Checker before retrying.

Multiple hosts behind one hostname. If your DNS points to three load balancer IPs and the certificate was only updated on two, a third of requests still fail. Verify every backend, not just the first IP the DNS query returns.

STARTTLS protocols. Mail servers, LDAP directories, and FTP servers need the certificate deployed to the service's specific configuration, not just to the web server running on the same host. A cert that works for HTTPS may not be loaded by Postfix until it's explicitly configured there.

Verifying the Fix

Before calling the incident resolved, verify from multiple vantage points:

  • Check the new certificate's notAfter date is at least 30 days out
  • Validate the full chain, not just the leaf certificate
  • Query the hostname from several locations (automated monitoring does this continuously; for manual verification, Mr. DNS Propagation is useful)
  • Confirm every IP behind the DNS hostname serves the new certificate
  • Test any dependent services (mail client connections, API integrations, mTLS peers)

A "fix" that only works from the deployment server or from one backend isn't a fix.

Postmortem Priorities

After service is restored, the postmortem should answer three questions:

  1. Why did we not know this was about to expire? If monitoring existed, why didn't it alert? If it didn't exist, why not?
  2. Why did automation fail? A renewal process that silently stopped working at some point needs a root cause, not just a manual intervention.
  3. What else is at risk? If one certificate's renewal failed, others may be failing silently. Audit the full inventory.

The most common postmortem finding is that monitoring was either missing or configured with an alert threshold too close to expiry. SSL certificate expiration monitoring covers the configuration details: choosing the right alert window, covering internal certificates, and validating the full chain.

Preventing the Next One

A good response playbook is one you don't use. Generator Labs certificate monitoring tracks every certificate in your infrastructure, validates the full chain across every backend IP, and alerts on expiration, hostname mismatches, chain breaks, and CA trust failures. When configured with a sensible alert threshold, certificate expiry stops being an incident class and becomes a routine maintenance notification.

Back to Blog