How to test your backup and restore plan, the right way

The explosion of ransomware threats over the past year gas put added emphasis on the importance of not only backing up critical data but being able to quickly restore it after disaster strikes.

Yet, if pressed for an honest answer, many IT leaders would likely struggle to explain how the increased precautions they’ve been taking would get their organisations up and running after an attack.

Why?

Because few companies test their restore plans often or effectively enough – if at all.

Phil Goodwin, an IDC analyst focused on data management, says about 25 percent of all data restoration jobs fail, and about half of all organisations have experienced a “non-recoverable data event” in the past 12 months. Many of those events could have been avoided had the organisations better prepared for real-world scenarios with more robust recovery strategies, he says.

“You need to have immutable copies of your data located both on site and off site to maximise cyber-recovery and the chances of zero data loss,” says Goodwin. “Beyond that, you also have to be incredibly diligent in testing your restoration plans. Too many organisations put plans together that are great when written but lose their relevancy as the cyberthreat or disaster landscapes shift. You have to test regularly to minimise downtime that could disrupt your business and hurt your bottom line.”

Best practices for beating downtime

There are numerous best practices for implementing a robust restoration test plan and avoiding downtime, which can cost as much as $11,600 per minute for large enterprises.

First, regularly scrutinise your backups, Up to 60 percent of backups are never fully completed, and 77 percent of tape backups fail altogether. Without complete and uncorrupted backups, recovery is impossible. To overcome this situation, experts recommend considering modern disaster recovery solutions that automate and ensure backup integrity across on-premises and cloud environments. Solutions from the likes of Cohesity, Commvault, Rubril, Veeam, Veritas and Zerto (owned by Hewlett Packard Enterprise) are available as standalone tools or through services from third-party consultants and advisers.

As a next step in developing a restoration and test plan, Amir Chaudry, vice president of storage at HPE, recommends cataloging all applications in the enterprise and assigning criticality to all of them.

“You have to be able to stack-rank the importance of each of your apps so if you do have to restore from your servers, you’re bringing things back online according to business and customer priorities,” he says. “That means knowing, in advance, which apps are most vital to your ongoing operations.”

Chaudry says another part of the plan should identify an organisation’s goals for recovery point objectives (RPO) and recovery time objectives (RTO). RPO measures how much data an organisation can withstand losing before it runs aground. RTO gauges how much time an organisation can be without key apps without causing significant damage.

“Keep it real when considering RPO and RTO,” says Chaudry. “Many IT leaders go about modeling with rose-colored glasses and decide their networks will do just fine because they planned so well. In truth, it doesn’t take long for lost data or a downed application to wreak havoc.”

Aligning tests and KPIs

Once IT leaders better map their data and application dependencies, they can then decide how best to test against them.

Haim Glickman, a senior vice president at Sungard Availability Services, a managed disaster recovery company in Pennsylvania, notes testing can be complex. As such, many organisations offload those responsibilities to firms such as his. This gives them access to outside expertise and the most up-to-date tools so they can concentrate on other business or operational priorities.

But if companies decide they have the staff, expertise, and gumption to handle it themselves, Glickman recommends that they remember their high school Biology 101 lab courses: do everything in controlled “bubbles.” Create a staged environment that won’t interfere with your live production data. The test environment can be on premises or in the cloud. Whichever ends up being the case, do not allow the two environments to intermingle, because it could skew results and even lead to operational hazards.

Ken van Wyk, president of KRvW Associates, a small cybersecurity and incident response firm in Virginia, takes it a step further with a concept called table-topping. As he defines it, this is where an IT department simulates an actual emergency. Rather than just running digital tests in a bubble, he comes in, looks over the organisation’s cataloging, criticality, and RPO and RTO analyses, and then goes to work.

“I try to get clients to take a hard look at their assessments, but ultimately, you’ll always have people in the room who are hopelessly optimistic about their estimates of downtime,” he says. “So I push them to do live testing. We disconnect a system for a while and simulate downtime in a very real way. That can be really disruptive, of course. You need to carefully consider how you do it so you don’t make things worse than what you’re simulating. But the exercise can be eye-opening, allowing you to realistically practice and prepare for a major disaster.”

The disconnect process should not be viewed as a general solution, van Wyk adds. Rather, it should be reserved for specific instances where production resilience is regularly tested, he says.

Creative testing

Chaudry agrees, noting it’s not only critical to run both digital and live simulations but to do so weekly or monthly because data is constantly changing. In addition, he suggests testing restores before and after system changes or upgrades. What’s more, tests should examine both new and old data. Too often, it’s the oldest data that becomes most troublesome during recovery efforts, he says.

In the end, Chaudry says any testing program needs to map back to RPO and RTO goals. IT teams should check the duration of restore simulations against those targets. If they’re misaligned, they need to adjust and test again.

“Rinse and repeat,” Chaudry says. “Tools and outside consultants help you do that more effectively, especially those offering continuous data protection features. But if you think you can handle all of this yourself, make sure you’re backing up efficiently and testing all of your data and apps as much as possible. Your organisation’s very existence may depend upon it.”

How to test your backup and restore plan, the right way

Best practices for beating downtime

Aligning tests and KPIs

Creative testing

Download our HCI ebook

Other technology insights you might be interested in

Why HCI could hold the key to your five-year peace of mind

Why decentralisation is the future of digital identities

4 challenges of protecting sensitive data

For everybody at ProCloud, delivering successful IT is a personal matter.