Quarterly tested backup restores, with the recovery clock measured
Backups exist at most large organisations. Tested restores do not. The single difference between a six-day outage and a six-hour outage is whether the runbook has actually been run.
- Quadrant
- Quick win
- Ease
- 4 / 5
- Impact
- 4 / 5
- Control family
- Recovery
- Cost band
- low
- Catalogued incidents
- 9
What the control is
The backup itself is the necessary precondition, not the control. The control is whether the restore from that backup has been tested under realistic conditions, with the clock running, by the people who would actually be doing it during an incident, against a system that genuinely matters to the business. Quarterly. With the metrics — recovery time, recovery point, failed steps, missing dependencies — written down and reported up.
The shape of a useful drill: pick a tier-1 system, declare it ransomwared in the exercise scenario, walk the actual runbook to rebuild it from backup. Measure the elapsed time end to end, including the restore-prerequisite steps that almost always get missed in planning — the AD trust restoration, the DNS, the certificate re-issuance, the secret rotation, the dependency chain on the systems below it in the stack. The metric the board needs is the elapsed time, not the line-item “yes we have backups.”
Why it matters
Backups by themselves do not save organisations. The catalogue is unusually clear on this. Colonial Pipeline (May 2021) had backups; the recovery time was unbearable enough that paying the ransom was the faster path. JBS Foods paid eleven million dollars in the same week for the same reason. Ireland’s HSE (May 2021) spent months running emergency ambulance-routing manually because the restore process had never been rehearsed at scale and the dependency chain between systems wasn’t documented. The British Library is still, eighteen months on, restoring fragments of the catalogue that broke during the Rhysida ransomware event. Travelex went into administration shortly after the Sodinokibi attack of December 2019. CDK Global, Kaseya, MOVEit and JLR all sat in extended outages that compounded the original event.
Norsk Hydro is the counter-example most often cited because it earned the citation. The 2019 LockerGoga event hit them hard, but their offline-backup posture had been rehearsed, the runbook had been run, and the engineering culture treated the rebuild as a test of an existing process rather than an emergency. They were back online quickly, refused to pay, and shared the timeline publicly. The difference between Norsk Hydro and Travelex is not whether they had backups — both did — but whether the restore had been rehearsed.
The same dynamic shows up in every ransomware incident in the catalogue. The organisations that recovered fastest had pre-built playbooks and had run them. The organisations that paid the ransom did so because the alternative — restoring from a backup nobody had tested — was a worse business risk than the cost of the cryptocurrency transfer.
Where the regulators sit
NCSC’s blog post “Offline backups in an online world” is the most-cited British piece of writing on this and the framing is direct: backups are not a control unless the restore has been verified. NCSC’s Cyber Assessment Framework principle D1 (“Response and recovery planning”) requires that recovery procedures are exercised. NIST SP 800-34 Rev. 1 (“Contingency Planning Guide for Federal Information Systems”) specifies tabletop and full-rebuild exercises at defined cadences. CIS Controls v8 Control 11 (“Data Recovery”) makes recovery testing an explicit sub-control. CISA’s Stop Ransomware guidance places tested backups at the top of the prevention pyramid. The Australian Essential Eight requires regular backups with periodic restore tests at maturity level 1, full restore validation at level 3.
The unanimity here is older than most other controls in the catalogue. The argument has not moved in a decade.
Where it usually breaks
Three failure modes show up consistently. The first is the AD-trust dependency. Modern enterprise systems live in a tree of authentication dependencies that runs back to Active Directory or its cloud equivalent. If the restore order is wrong — if the application-tier comes back before the identity-tier, or if the hypervisor management plane comes back before the management-plane authentication source — the restore stalls, and people learn this at three in the morning during a real incident. The fix is a documented and tested rebuild order.
The second is secret rotation. Backups contain credentials. After ransomware, every credential in the backup is suspect. The restore plan has to include a credential-rotation step before the restored systems come back online — service accounts, API keys, certificates, the lot. Most plans don’t, and the restored environment carries forward the same credentials the attacker has, which means the attacker walks back in.
The third is the offline copy. Online backups visible to the production AD domain are not backups against ransomware; they are additional targets. The catalogue includes several incidents where the attacker reached the backup repository and encrypted or deleted it before encrypting the production systems. The fix is air-gapped or immutable storage for the recovery copies, with separate authentication, and the restore drill has to include the network/permission steps to access them.
What good looks like
A documented restore runbook for every tier-1 service, owned by the team that runs that service, signed off annually by the service owner. A quarterly drill schedule that rotates through the tier-1 services so each one is tested at least once a year. An air-gapped or immutable recovery copy for every system in scope, with a documented and tested access procedure. A measured recovery time and recovery point for each drill, reported to a senior risk forum. A standing budget line for closing the gaps the drills surface.
The cost of the control is the people-hours to run the drills. The benefit is the difference between a Norsk Hydro outcome and a Travelex outcome. The catalogue has both.
Where this control would have changed the outcome
- Colonial Pipeline — DarkSide ransomware DarkSide ransomware encrypted Colonial Pipeline's billing, prompting a six-day shutdown of the largest US East Coast fuel pipeline; Colonial paid $4.4M, DOJ recovered $2.3M.
- JBS Foods — REvil ransomware REvil ransomware took JBS Foods — the world's largest meat processor — offline globally; JBS paid an $11M ransom to restore operations within days, then disclosed it.
- Ireland's HSE — Conti ransomware Conti ransomware entered Ireland's Health Service Executive via a phishing email, encrypted core clinical systems, and forced hospitals to cancel tens of thousands of appointments.
- British Library — Rhysida ransomware Rhysida ransomware encrypted the British Library's systems in October 2023; the Library refused to pay, lost 600GB of data to publication, and faced a £6–7M recovery bill.
- Norsk Hydro — LockerGoga ransomware LockerGoga ransomware was pushed via Active Directory to every Norsk Hydro Windows workstation simultaneously, halting aluminium production globally and costing the company over $70M to recover.
- Travelex — Sodinokibi ransomware A New Year's Eve ransomware deployment took Travelex's foreign-exchange systems offline for weeks, contributed to its August 2020 administration, and forced UK store closures.
- Jaguar Land Rover — production halt Vishing calls and stale infostealer credentials gave attackers admin access to JLR's SAP systems; ransomware halted five-plant production for five weeks on the UK's busiest plate-change day.
- CDK Global — auto-dealer SaaS ransomware BlackSuit ransomware took CDK Global offline for two weeks, halting transactions at 15,000 North American auto dealerships; CDK reportedly paid a $25M ransom rather than rebuild from backup.
- Kaseya VSA — REvil supply-chain ransomware REvil exploited a zero-day authentication bypass in Kaseya VSA to push ransomware through managed service providers to roughly 1,500 downstream businesses in July 2021.
Sources
- NCSC — Offline backups in an online world // primary
- NCSC Cyber Assessment Framework — D1: Response and recovery planning // primary
- NIST SP 800-34 Rev. 1 — Contingency Planning Guide for Federal Information Systems // primary
- CIS Controls v8 — Control 11: Data Recovery // primary
- CISA — Stop Ransomware: Backup and recovery guidance // primary
- ACSC Essential Eight — Regular backups // primary