Back to The Controls Desk
// Controls Desk · 30 April 2026 · Recovery

Quarterly tested backup restores, with the recovery clock measured

Backups exist at most large organisations. Tested restores do not. The single difference between a six-day outage and a six-hour outage is whether the runbook has actually been run.

Quadrant
Quick win
Ease
4 / 5
Impact
4 / 5
Control family
Recovery
Cost band
low
Catalogued incidents
9

What the control is

The backup itself is the necessary precondition, not the control. The control is whether the restore from that backup has been tested under realistic conditions, with the clock running, by the people who would actually be doing it during an incident, against a system that genuinely matters to the business. Quarterly. With the metrics — recovery time, recovery point, failed steps, missing dependencies — written down and reported up.

The shape of a useful drill: pick a tier-1 system, declare it ransomwared in the exercise scenario, walk the actual runbook to rebuild it from backup. Measure the elapsed time end to end, including the restore-prerequisite steps that almost always get missed in planning — the AD trust restoration, the DNS, the certificate re-issuance, the secret rotation, the dependency chain on the systems below it in the stack. The metric the board needs is the elapsed time, not the line-item “yes we have backups.”

Why it matters

Backups by themselves do not save organisations. The catalogue is unusually clear on this. Colonial Pipeline (May 2021) had backups; the recovery time was unbearable enough that paying the ransom was the faster path. JBS Foods paid eleven million dollars in the same week for the same reason. Ireland’s HSE (May 2021) spent months running emergency ambulance-routing manually because the restore process had never been rehearsed at scale and the dependency chain between systems wasn’t documented. The British Library is still, eighteen months on, restoring fragments of the catalogue that broke during the Rhysida ransomware event. Travelex went into administration shortly after the Sodinokibi attack of December 2019. CDK Global, Kaseya, MOVEit and JLR all sat in extended outages that compounded the original event.

Norsk Hydro is the counter-example most often cited because it earned the citation. The 2019 LockerGoga event hit them hard, but their offline-backup posture had been rehearsed, the runbook had been run, and the engineering culture treated the rebuild as a test of an existing process rather than an emergency. They were back online quickly, refused to pay, and shared the timeline publicly. The difference between Norsk Hydro and Travelex is not whether they had backups — both did — but whether the restore had been rehearsed.

The same dynamic shows up in every ransomware incident in the catalogue. The organisations that recovered fastest had pre-built playbooks and had run them. The organisations that paid the ransom did so because the alternative — restoring from a backup nobody had tested — was a worse business risk than the cost of the cryptocurrency transfer.

Where the regulators sit

NCSC’s blog post “Offline backups in an online world” is the most-cited British piece of writing on this and the framing is direct: backups are not a control unless the restore has been verified. NCSC’s Cyber Assessment Framework principle D1 (“Response and recovery planning”) requires that recovery procedures are exercised. NIST SP 800-34 Rev. 1 (“Contingency Planning Guide for Federal Information Systems”) specifies tabletop and full-rebuild exercises at defined cadences. CIS Controls v8 Control 11 (“Data Recovery”) makes recovery testing an explicit sub-control. CISA’s Stop Ransomware guidance places tested backups at the top of the prevention pyramid. The Australian Essential Eight requires regular backups with periodic restore tests at maturity level 1, full restore validation at level 3.

The unanimity here is older than most other controls in the catalogue. The argument has not moved in a decade.

Where it usually breaks

Three failure modes show up consistently. The first is the AD-trust dependency. Modern enterprise systems live in a tree of authentication dependencies that runs back to Active Directory or its cloud equivalent. If the restore order is wrong — if the application-tier comes back before the identity-tier, or if the hypervisor management plane comes back before the management-plane authentication source — the restore stalls, and people learn this at three in the morning during a real incident. The fix is a documented and tested rebuild order.

The second is secret rotation. Backups contain credentials. After ransomware, every credential in the backup is suspect. The restore plan has to include a credential-rotation step before the restored systems come back online — service accounts, API keys, certificates, the lot. Most plans don’t, and the restored environment carries forward the same credentials the attacker has, which means the attacker walks back in.

The third is the offline copy. Online backups visible to the production AD domain are not backups against ransomware; they are additional targets. The catalogue includes several incidents where the attacker reached the backup repository and encrypted or deleted it before encrypting the production systems. The fix is air-gapped or immutable storage for the recovery copies, with separate authentication, and the restore drill has to include the network/permission steps to access them.

What good looks like

A documented restore runbook for every tier-1 service, owned by the team that runs that service, signed off annually by the service owner. A quarterly drill schedule that rotates through the tier-1 services so each one is tested at least once a year. An air-gapped or immutable recovery copy for every system in scope, with a documented and tested access procedure. A measured recovery time and recovery point for each drill, reported to a senior risk forum. A standing budget line for closing the gaps the drills surface.

The cost of the control is the people-hours to run the drills. The benefit is the difference between a Norsk Hydro outcome and a Travelex outcome. The catalogue has both.

Where this control would have changed the outcome

Sources

Back to The Controls Desk