Your DR Plan Has a 4-Hour RTO on Paper and a 4-Day RTO in Reality

Let's start with the uncomfortable truth

You have a DR plan. It's probably in a SharePoint folder nobody remembers the path to, it references an AWS region you stopped using eighteen months ago, and it says your RTO is four hours. And somewhere in your org, a CISO presented that number to the board last quarter and everyone nodded.

Here's the thing — that four-hour RTO is a fantasy. Not a goal. A fantasy. And the worst part is, you probably won't find out it's a fantasy until the moment you absolutely cannot afford to find out.

I've watched this play out in real environments. A company gets hit with ransomware on a Thursday night — not a fun surprise, ransomware never is — and by Friday morning the incident response call has fifty people on it, half of whom have never spoken to each other. The first question out of the CISO's mouth is "what's our RTO?" Someone reads a number off a document. That number means nothing to the person who has to actually restore the database. By Saturday afternoon, they're calling vendors to ask questions they should have answered in 2022.

RTO, RPO, and the metric that actually matters that nobody puts in their BCP

Let's be precise, because sloppy terminology is how DR plans stay comfortably vague. Your Recovery Time Objective (RTO) is how long you can be down before the business bleeds out. Your Recovery Point Objective (RPO) is how much data loss is acceptable — how far back can you restore before regulators, customers, or your CFO start making calls you don't want to take. Both of these are business decisions masquerading as technical ones.

The metric almost nobody in the BCP document actually addresses is Maximum Tolerable Period of Disruption (MTPD) — sometimes called MTBD. ISO 22301 defines it as the duration after which an organization's viability is threatened if a process cannot be resumed. That's the cliff edge. RTO should sit well inside it, not right up against it. When your RTO and MTPD are within hours of each other, you have zero margin for the actual messiness of incident response.

Now here's the hidden RPO problem that will ruin your Thursday night: database replication lag. You've got MySQL or Postgres replicating async to a standby. You've declared a 15-minute RPO. What you haven't measured is that under peak write load, your replica is running 40 minutes behind. Your RPO is not 15 minutes. Your RPO is whatever your worst-case replication lag is, which you have almost certainly never measured under real traffic, and you will discover this fact at the worst possible time.

NIST SP 800-34 Rev 1 frames contingency planning around documented MTI (Maximum Tolerable Inactivity) thresholds that are derived from actual BIA findings, not guesses. Most organizations skip the BIA entirely or produce a BIA document that's so high-level it couldn't identify a critical process if the process wore a nametag. A useful BIA maps specific systems to specific revenue or operational impacts per hour of downtime, identifies upstream and downstream dependencies, and — critically — identifies what your recovery depends on that isn't in your control. Spoiler: a lot of it isn't in your control.

The "we tested failover in 2019" hall of shame

There is a category of DR plan that is worse than having no DR plan at all: the DR plan that was tested once, passed, and then never touched again while the infrastructure it describes was completely rebuilt around it.

I've seen this exact configuration: a warm standby environment that was perfectly valid when someone set it up three years ago, except the application now depends on three microservices that weren't in the architecture then, the IAM roles assumed during failover haven't been updated to include permissions for a payment processing integration added eighteen months ago, and the runbook references an Ansible playbook in a repo that was deprecated and archived. The warm standby exists. It would fail to actually serve traffic within about four minutes of cutover.

AWS gives you architectural options across a spectrum of recovery strategies — active-active, pilot light, warm standby, backup and restore — and each one has a wildly different actual RTO even if you've declared the same number in your BCP. A pilot light setup where you're keeping just the database replicated and the core AMIs current might have a stated RTO of one hour. But that one hour assumes someone knows what to do, can get into the AWS console, has the right permissions, and the CloudFormation or Terraform that spins up the rest of the stack actually still works. Have you run that Terraform recently? Against the current state of your infrastructure? If the answer is "not recently," your one-hour RTO is fiction.

The real problem with tabletop exercises — and I'll be blunt — is that most of them test nothing. They test whether people can read a scenario and discuss it. That's useful for communication planning and maybe for identifying obvious gaps in who's responsible for what. It does not tell you whether your backup restore actually works, whether your DNS failover propagates in time, or whether your application will start successfully when pointed at a replica database that's been in standby mode for eight months.

Ransomware recovery is not traditional DR and treating it like DR will get you destroyed

This is the one that really gets me. Traditional DR is designed around infrastructure failure — a data center goes dark, an availability zone has a network partition, your primary database server catches fire in a metaphorical or literal sense. The threat model is accidental. The data you're restoring from is assumed to be clean.

Ransomware is an adversarial event. The threat actor has been in your environment. They have likely been in your environment for weeks or months before they pulled the trigger. Your backups from two weeks ago might have the initial access mechanism in them. Your snapshots might have a webshell sitting in a temp directory. When you restore from those backups, you're potentially restoring the compromise along with the data.

This is why backup immutability is non-negotiable and why it addresses a different problem than backup existence. S3 Object Lock with compliance mode means nobody — including a compromised admin account — can delete or modify those backups during the retention period. Veeam's immutable backup repositories using Linux hardened repositories with single-use credentials accomplish the same thing on-prem. You want your backups to be immutable not just because ransomware operators will try to delete them before detonating, but because you need to be able to trust the integrity of what you're restoring from.

But here's the second problem: even with clean, immutable backups, your ransomware recovery timeline is not your DR timeline. After a ransomware incident, you're dealing with active incident response running in parallel with recovery. You're making decisions about what to restore first while you don't fully understand the scope of the breach. You're spinning up systems and then taking them back down because forensics needs them. You're working with external IR firms who are on a different timeline and have different priorities than your ops team. Maersk's recovery from NotPetya in 2017 involved rebuilding roughly 45,000 PCs, 4,000 servers, and 2,500 applications across 130 countries. They did it in ten days. That was considered remarkable speed. Their "RTO" for their DR plan was certainly not ten days. Reality doesn't care about your plan.

The lesson from Maersk isn't that they failed — it's that ten days of chaos was the outcome even for a massive organization with significant resources throwing everything at recovery. The differentiating factor was that they had a functional network team who understood the infrastructure and knew how to rebuild from scratch. The plan was irrelevant. The institutional knowledge and the clean offsite copies that the adversary hadn't reached were what mattered.

Your communication plan assumes the tools are up. They won't be.

Quick scenario: your Microsoft 365 tenant is partially down because it shares an identity plane with the environment you just isolated. Or us-east-1 is having a bad night and your Slack is degraded. Or — a personal favorite — your DR runbook is in Confluence, and Confluence is on the same network segment you just cut off.

Almost every communication plan in every BCP I've ever read lists Slack or Teams as the primary incident communication channel. These are SaaS tools that have their own dependencies. During the AWS us-east-1 incident in December 2021, a significant number of AWS-dependent SaaS tools were either degraded or completely unavailable. If your company runs on those SaaS tools and your DR plan assumes you'll coordinate through them, you've built a circular dependency into your crisis response.

You need out-of-band communication paths that don't depend on the infrastructure you're trying to recover. This means phone trees with actual phone numbers — personal mobile numbers, not office extensions routed through a VoIP system that's down. It means a war room that isn't a Teams meeting. It means having your runbooks somewhere offline accessible, or at minimum on a platform with a completely different dependency chain than your primary environment.

ISO 22301 explicitly requires communication procedures that account for degraded or unavailable primary communication systems. Most implementations I've seen give this a paragraph and move on. It deserves a full procedure.

What actually makes a DR test worth running

Stop doing tabletops and calling them DR tests. They're useful for process alignment and they have a place, but they are not a substitute for exercising the actual recovery path. What you actually need to run:

Full failover tests in production-like conditions — not on a Sunday morning when traffic is at 3% of peak, not with a pre-announcement to the ops team. Run a game day that mimics real conditions. Kill something that matters. See what happens.
Restore-from-backup validation on a defined cadence — pick a random backup every month, restore it to an isolated environment, verify the data integrity, verify the application actually starts. Document the actual time it took. That's your real RTO data point, not an estimate.

Your runbooks should be written by the person who will actually execute them at 2am, not the person who designed the architecture. There is a meaningful difference between documentation written to describe a system and documentation written to recover a system while tired, stressed, and probably on a call with your CISO at the same time. The latter is a step-by-step procedure with specific commands, expected outputs, and decision points. If your runbook says "restore the database" without specifying the exact command, the target host, the credentials location, and how to verify it worked, it's not a runbook. It's an outline.

Track your actual recovery metrics. Every time you do a restore or a failover test, record the real time from incident declaration to service restoration. Build a historical dataset. Your RTO on paper should be a target derived from your worst observed actual recovery time, not a number someone picked because it sounded reasonable in a meeting.

The gap between your documented RTO and your actual recovery capability is a risk. It belongs in your risk register with a real likelihood and impact assessment, not buried in a DR document that gets reviewed annually by someone who doesn't run the infrastructure. If your CISO is presenting a four-hour RTO to the board and your last actual restore took eleven hours on a good day, that's a material misrepresentation of your risk posture. Own the gap or close it. Don't hide it in a document nobody reads.